ce | Ae 
ENGL? 
WILEY PUBLICATIONS IN STATISTICS > 


Walter A. Shewhart Samuel S.. Wilks 
Editors 


7 
+ 


Mathematical Statistics 
ALEXANDER - Elements of Mathematical Statistics 
ANDERSON ~ An Introduction to Multivariate Statistical Analysis 
BLACKWELL and GIRSHICK - Theory of Games and Statistical Decisions 
CRAMER - The Elements of Probability Theory and Some of Its Applications 
DOOB - Stochastic Processes 
DWYER - Linear Computations 
FELLER - An Introduction to Probability Theory and Its Applications, Volume 1, 
‘Second Edition. 
FISHER + Contributions to Mathematical Statistics 
FRASER - Nonparametric Methods in Statistics 
FRASER - Statistics—An Introduction 
GRENANDER - Probability and Statistics 
GRENANDER and ROSENBLATT - Statistical Analysis of Stationary Time Series 
HANSEN, HURWITZ, and MADOW - Sample Survey Methods and Theory, Volume II 
HOEL - Introduction to Mathematical Statistics, Third Edition 
KEMPTHORNE + The Design and Analysis of Experiments 
KULLBACK . Information Theory and Statistics 
LEHMANN - Testing Statistical Hypotheses 
PARZEN - Modern Probability Theory and Its Applications 
RAO + Advanced Stati8tical Methods in Biometric Research 
RIORDAN - An Introduction to Combinatorial Analysis 
SAVAGE - Foundations of Statistics 
SCHEFFE - The Analysis of Variance 
‘WALD - Sequential Analysis 
WALD - Statistical Decision Functions 
WILKS . Mathematical Statistics 


Applied Statistics 
ACTON +- Analysis of Straight-Line Data 
BENNETT and FRANKLIN - Statistical Analysis in Chemistry and the Chemical Industry 
BROWNLEE - Statistical Theory and Methodology in Science and Engineering 
BUSH and MOSTELLER - Stochastic Models for Learning 
‘CHEW - Experimental Designs in Industry 
CLARK - An Introduction to Statistics 
COCHRAN - Sampling Techniques, Second Edition 
COCHRAN and COX - Experimental Designs, Second Edition 
CORNELL - The Essentials of Educational Statistics 
COX - Planning of Experiments 
DEMING - Sample Design in Business Research 

> DEMING : Some Theory of Sampling 

DODGE and ROMIG - Sampling Inspection Tables, Second Edition 
FRYER * Elements of Statistics 
GOULDEN - Methods of Statistical Analysis, Second Edition 
HALD - Statistical Tables and Formulas 
HALD - Statistical Theory with Engineering Applications 


Applied Statistics (Continued) 


HANSEN, HURWITZ, and MADOW -+ Sample Survey Methods and Theory, Volume I 
HOEL - Elementary Statistics 

KEMPTHORNE - An Introduction to Genetic Statistics 

MEYER - Symposium on Monte Carlo Methods 

MUDGETT - Index Numbers 

RICE - Control Charts 

ROMIG - 50-100 Binomial Tables 

SARHAN and GREENBERG - Contributions to Order Statistics 
TIPPETT + Technological Applications of Statistics 

WILLIAMS - Regression Analysis 

WOLD and JUREEN - Demand Analysis 

YOUDEN - Statistical Methods for Chemists 


Books of Related Interest 


ALLEN and ELY - International Trade Statistics x 
ARLEY and BUCH - Introduction to the Theory of Probability and Statistics 
CHERNOFF and MOSES - Elementary Decision Theory 

HAUSER and LEONARD - Government Statistics for Business Use, Second Editioñ 
STEPHAN and McCARTHY - Sampling Opinions—An Analysis of Survey Procedures 


Stochastic 


Processes 


Aa 
Stochastic 


Processes 


J. L. DOOB 


Professor of Mathematics 
University of Ilinois 


New York - John Wiley & Sons, Inc. 


London 


Baines 
ats We, 
4 <* Library 


S ve 
© LisRrary. |2 


A in 
EY 


E R.T., West beng) COPYRIGHT, 1953 


> a S: G Y.. ar 


Jonn WiLey & Sons, INC. 
e Rola lS H Posse 


All Rights Reserved 
ky This book or any part thereof must not 
be reproduced in any form without 


\ \ the written permission of the publisher. 
A O COPYRIGHT, CANADA, 1953, INTERNATIONAL COPYRIGHT, 1953 
D Q JoHN WILEY & Sons, INC., PROPRIETORS 


All Foreign Rights Reserved 
Reproduction in whole or in part Sorbidden. 


FOURTH PRINTING, OCTOBER, 1962 


Library of Congress Catalog Card Number: 52-11857 


PRINTED IN THE UNITED STATES OF AMERICA 


Bureau EBAL » Research 


OW: | iB) “SE 
typed 28-2 £4 ... 


k p ~n ANT 


Preface 


A STOCHASTIC PROCESS IS THE MATHEMATICAL ABSTRACTION OF 
an empirical process whose development is governed by probabilistic 
laws. The theory of stochastic processes has developed so much in the 
last twenty years that the need of a systematic account of the subject 
has been strongly felt by students of probability, and the present book 
is an attempt to fill this need. The reader is warned that this book does 
not cover the subject completely, and that it stresses most those parts 
of the subject which appeal most to me. 

Although it would be absurd to write a book on stochastic processes 
which does not assume a considerable background in probability on 
the part of the reader, there is unfortunately as yet no single text which 
can be used as a standard reference. To compensate somewhat for this 
lack, elementary definitions and theorems are stated in detail, but there 
has been no attempt to make this book a first text in probability, even 
for the most advanced mathematical reader. 

There has been no compromise with the mathematics of probability. 
Probability is simply a branch of measure theory, with its own special 
emphasis and field of application, and no attempt has been made to 
sugar-coat this fact. Using various ingenious devices, one can drop 
the interpretation of sample sequences and functions as ordinary se- 
quences and functions, and treat probability theory as the study of 
systems of distribution functions. For a very short time this was the 
only known way to treat certain parts of the subject (such as the strong 
law of large numbers) rigorously. However, such a treatment is no 
longer necessary and results in a spurious simplification of some parts 
of the subject, and a genuine distortion of all of it. 

There is probably no mathematical subject which shares with proba- 
bility the features that on the one hand many of its most elementary 
theorems are based on rather deep mathematics, and that on the other 
hand many of its most advanced theorems are known and understood 
by many (statisticians and others) without the mathematical back- 
ground to understand their proofs. It follows that any rigorous treat- 
ment of probability is threatened with the charge that it is absurdly 
overloaded with extraneous mathematics. Against this charge I have 


v 


toja 


vi PREFACE 


an easy defense. Early versions of the book contained many attempts 
to evade various mathematical points and thereby make the book easier 
to read. The readers complained that the evasions only increased the 
obscurity. The evasions were therefore eliminated, and the reader of 
this final version can now estimate the clarity of the earlier versions. 
The additional mathematical discussion made advisable a supplement 
on measure theory. In the Supplement is included a treatment of vari- 
ous aspects of measure theory with which the ordinary reader may not 
be familiar. 

References to the literature and historical remarks have all been col- 
lected in the Appendix. Apologies are offered in advance to those who 
feel that they have been slighted in this Appendix, together with the 
assurance that they will probably have lots of company, and that no 
slight is intentional. The articles and books referred to in the Appendix 
are collected in a separate Bibliography. 

Although much of the material in the book has been reorganized 
as seemed best to fit the point of view of the book, and other material 
is new, it is stressed that, even where no reference is given, no result 
is to be considered new unless it is stated as such in the Appendix. 
Subsidiary results are usually not credited to anyone. 

Chapter XII, on prediction theory, is somewhat out of place in the 
book, since it discusses a rather specialized problem. It was put in 
because of the importance of the subject matter, and because of the 
lack of material on prediction theory, in the usual language of proba- 
bility, readily available to the American reader. I had the benefit of 
stimulating conversations with Norbert Wiener on this subject. 

I take this opportunity to thank a small group of friendly readers 
and savage critics. The advice of William Feller was particularly im- 
portant in shaping the organization of the book. The criticism given 
by him, by Kai Lai Chung, and by J. L. Snell was always prized even 
when not accepted. Finally, thanks are due to Kathryn Hollenbach 
who typed most of the manuscript. The Office of Naval Research partly 
subsidized me during a year of my work on the preparation of the 
manuscript. 

J. L. Doos 
University of Illinois 
September, 1952 


Contents 


I. INTRODUCTION AND PROBABILITY BACKGROUND 1 
Il. DEFINITION OF A STOCHASTIC PROCESS—PRINCIPAL 


CLASSES 46 

Ill. PROCESSES WITH MUTUALLY INDEPENDENT RAN- 
DOM VARIABLES 102 

IV. PROCESSES WITH MUTUALLY UNCORRELATED OR 
ORTHOGONAL RANDOM VARIABLES 148 
V. MARKOV PROCESSES—DISCRETE PARAMETER 170 
VI. MARKOV PROCESSES—CONTINUOUS PARAMETER 235 
VII. MARTINGALES 292 
VIII. PROCESSES WITH INDEPENDENT INCREMENTS 391 
IX. PROCESSES WITH ORTHOGONAL INCREMENTS 425 


X. STATIONARY PROCESSES—DISCRETE PARAMETER 452 
XI. STATIONARY PROCESSES—CONTINUOUS PARAMETER 507 
XII. LINEAR LEAST SQUARES PREDICTION—STATIONARY 


(WIDE SENSE) PROCESSES 560 
SUPPLEMENT 599 
APPENDIX 623 
BIBLIOGRAPHY 641 


INDEX 651 


vii 


CHAPTERS 


Introduction and 


Probability Background 


1. Mathematical prerequisites 

Although advanced mathematical methods are used in this book, the 
results will always be stated in probability language, and it is hoped that 
the book will be accessible to readers thoroughly familiar with the 
manipulation of random variables conditional probability distributions 
and conditional expectations. The necessary background will be outlined 
in this chapter. 

There is an unavoidable dilemma confronting the authors of advanced 
probability books. Probability theory is but one aspect of the theory of 
measure, with special emphasis on certain problems. An advanced proba- 
bility book must, therefore, either include a section devoted to measure 
theory or must assume knowledge of measure theoretic facts as given in 
scattered papers. The second alternative was chosen originally, to make 
the book less formidable, but pressure of critics and logic forced a com- 
promise which is as inconsistent as most compromises. It is hoped that 
the book remains in part at least accessible to statisticians and others who 
are not professional mathematicians but who are familiar with the formal 
manipulations of probability theory. 


2. The basic space 

Now that probability theory has become an acceptable part of mathe- 
matics, words like “occurrence,” “event,” “urn,” “die,” and so on can 
be dispensed with. These words and the ideas they represent, however, 
still add intuitive significance to the subject and suggest analytical methods 
and new investigations. It is for this reason that probability language is 


still used even in purely theoretical investigations. 
In the applications, probability is concerned with the occurrence of 


events, such as the turning up of a 5 on a die, a displacement in a given 
1 


2 INTRODUCTION AND PROBABILITY BACKGROUND I 


direction of a(t) centimeters’in ¢ seconds by a particle in a Brownian 
movement, and so on. Probability numbers are assigned to such events. 
For example, the number 1/, is usually assigned to the turning up of a 
5 on a die and 4/ to the turning up of an even number; in the Brownian 
movement example, the inequality (3) >7 is assigned a probability 
according to rules to be described in detail in VIII. For the die, the 
purely mathematical analysis goes as follows: Each possible result, that 
is, each one of the integers 1, © * *, 6, is assigned the probability number 
1/,; any class of n of these results is assigned the number 7/6. The state- 
ment that the probability of getting an even number is 1/3 = %/g is then 
simply the evaluation for this special case when n = 3. In the Brownian 
movement case the situation is more complicated and its analysis will be 
deferred for the present, but in this case also the question, be2omes that 
of assigning numbers to the mathematical abstractions of events. 
Throughout this book a point set is taken as the mathematical abstraction 
of an eyent. 

The theory of probability is concerned with the measure properties of 
various spaces, and with the mutual relations of measurable functions 
defined on those spaces. Because of the applications, it is frequently 
(although not always) appropriate to call these spaces sample spaces and 
their measurable sets events, and these terms should be borne in mind in 
applying the results. 

The following is the precise mathematical setting. (See the Supplement 
at the end of the book for a treatment of the concepts of field and measure 
suitable for this book.) It is supposed that there is some basic space Q, 
and a certain basic collection of sets of points of 2, These sets will be 
called measurable sets; it is supposed that the class of measurable sets is 
a Borel field. It is supposed that there is a function P{-}, defined for all 
measurable sets, which is a probability measure, that is, P{-} is a completely 
additive non-negative set function, with value 1 on the whole space. The 
number P{A} will be called the probability or measure.of A. The 
measurability of an @ function is defined in the usual way in terms of 
the measurable sets, and the integral with respect to the probability 
measure of a measurable function p over a measurable set A will be 


denoted b K 
: j glow) dP or fpa. 
A A 


A property true at all points of Q except at those of a set of probability 
0 will be said to be true almost everywhere, or true at almost all points w, 
or true with probability 1. 

For example, consider the analysis of the tossing (once) of a die. In 
this case the simplest suitable space © consists of the six points 1, © > +, 6, 


§2 THE BASIC SPACE 3 


with the identification of the point j as the event the number j turns up 
when the die is tossed. Every Q set is measurable in this case and is 
assigned as measure one-sixth the number of its points. In this case the 
space Q is certainly appropriately described as sample space, because of 
the simple correspondence between its points and the possible outcomes 
of the experiment. Now consider a second mathematical model of this 
same experiment. In this model the space Q consists of all real numbers 
and the Q points 1, - « +, 6 are identified with the same events as above. 
No other identifications are made. All Q point sets are again measurable 
and the measure of any point set is one-sixth the number of the points 
1, + + +, 6 which lie in the set. This mathematical model is practically 
the same as the preceding one, but the name “sample space” for Q is 
somewhat less appropriate because the correspondence between £} points 
and events is less simple. Finally consider a third mathematical model 
of this same experiment. In this model the space Q consists of all the 
numbers in the semi-closed interval [1, 7) and this time the interval 
Li, j + 1) is identified with the event the number j turns up when the die is 
tossed. The measurable point sets in this case are the intervals [j, j + 1) 
or unions of these intervals, and the measure of any measurable set is 
one-sixth the number of these intervals in it. This model is just as usable 
as the two preceding ones, but the name “sample space” is certainly not 
appropriate for Q. The natural objection that the last two models, 
particularly the last one, are unsuitable, and should be excluded because 
they are needlessly complicated, is easily refuted. In fact “needlessly 
complicated” models cannot be excluded, even in practical studies. How 
a model is set up depends on what parameters are taken as basic. In the 
first model above the actual outcomes of the toss were taken as basic. 
If, however, the die is thought of as a cube to be tossed into the air and 
let fall, the position of the cube in space is determined by six parameters, 
and the most natural space 2 might well be considered to be the twelve- 
dimensional space of initial positions and velocities. A point of Q 
determines how the die will land. Let A, be the set of those {2 points 
which give rise to the outcome j. Then the usual hypothesis made is that 
the assignment of probabilities to © sets assigns probability 1/, to each 
A,. If we are only interested in the probabilities of the results of the 
experiment, this is all we need to know about 2 probabilities, and we 
then have a mathematical model similar to our third one above. The 
point is that both in practical and in theoretical discussions, even if 
initially the space Q is chosen to fit some criterion of simplicity (say that 
the name “sample space” be appropriate), this criterion is likely to be 
lost in the course of a discussion in which we got distributions derived 
in some way from the initial one. In accordance with this fact we have 


4 INTRODUCTION AND PROBABILITY BACKGROUND I 


imposed no condition whatever on the space © not implicit in the existence 
of the set function P{A}, and the existence of this set function prevents Q 
from being empty but is not otherwise restrictive. In the course of this 
book © is usually simply an abstract space. In some cases, however, Q 
will be taken to be the interval 0 < x < 1, the space of all finite or infinite 
valued real functions of t, — 00 < t < ©, the finite plane and many other 
spaces. The only specifically probabilistic hypothesis imposed on the 
measure function P{A} is the normalizing condition P{A} = 1. Outside 
the field of probability there is frequently no reason to restrict measure 
functions to be finite-valued, and if finite-valued there is no reason to 
restrict the value on the whole space to be 1. 

We now analyze the tossing of a die still further, to illustrate the 
importance of product spaces as the basic spaces. In analyzing the tossing 
of a die an unlimited number of times, the natural basic space Q is the 
obvious sample space, defined as follows. Each point œ of Q is a sequence 
&,, &, * + +, where &; is one of the integers 1, + > *, 6. That is, each point 
of Q is one of the conceptually possible outcomes of an experiment in 
which a die is tossed infinitely often. If the class of all sequences begin- 
ning with j is identified with the event the number j turns up the first time 
the die is tossed, and if this class is assigned the probability */ we have 
still another mathematical model of a single toss. The advantage of this 
model over those described above is that this model is adaptable to any 
number of tosses if the class of all sequences beginning with j °° +) jn 
is identified with the event the numbers j + ` * ja turn up in the first n 
tosses. With the usual hypotheses, this class of sequences, that is, this 
Q set, is assigned the measure (1/6)". While it is true that in many 
applications spaces of infinite sequences of this sort are unnecessary, 
because only a fixed finite number of experiments is to be discussed, the 
infinite sequences cannot be avoided in some problems of very simple 
character. For example, the analysis of the first time an event occurs 
(say the first 6 in a succession of throws of a die) cannot be done in a 
satisfactory manner without the space of sample infinite sequences, 
because the number of trials before the event occurs will be a number 
which cannot be bounded in advance. This number is an unbounded 
integral valued function of Q. In this die example Q is a product space, 
one with infinitely many factor spaces each of which contains six points. 
The natural sample space for a repeated trial is always a product space. 

If C represents conditions on points of Q, the notation {C} will be used 
for the set of points satisfying those conditions. For example, if X is a 
linear set, and if æ is an w function, {a(~) e X} is the œ set on which 
x(w) is a number in the set X. 

We have not yet assumed that our basic probability measure is complete 
(see Supplement), That is, if Ap is measurable, and has measure 0, and 


$3 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 5 


if A is a subset of Ag, then A need not be measurable. In the language 
of probability this means that there may be two events, A and Aj, with 
the properties that the occurrence of A implies that of Ay, and that 


P{A,} = 0, 


and yet A is not assigned any probability. (Clearly if P{A} is defined 
it is 0.) This possibility may be somewhat disturbing to a mathematician 
who wishes to preserve his intuitive concept of probability. One has the 
choice here of either changing one’s intuition or one’s mathematics if one 
insists on a close correspondence between the two. The choice is not 
very important, but in any event the mathematics is easily modified for 
the sake of those with stubborn intuitions. In fact (see Supplement §2) 
the given probability measure can be completed in a unique way by a 
slight enlargement of the class of œ sets whose measures are defined. 
To avoid fussy details in II §2, we assume in this book that P measure 
is complete. 


3. Random yariables and probability distributions 


A (real) function æ, defined on a space of points w, will be called a 
(real) random variable if there is a probability measure defined on œ sets, 
and if for every real number the inequality (w) < A delimits an @ set 
whose probability is defined, that is, a measurable w set. Thus 


(3.1) F(a) = Pfa(w) <3} 


is defined for all real 2. In mathematical language a (real) random 
variable is thus simply a (real) measurable function. A complex random 
variable is an w function whose real and imaginary parts are measurable. 

Throughout this book, whenever more than one random variable is 
involved in a discussion, it will always be assumed unless the contrary 
is explicitly stated that the random variables are all defined on the same 
w space. 

Random variables were manipulated by probabilists long before it was 
recognized that the mathematical concept involved was that of a measur- 
able function, and in fact long before measure theory was invented. Thus 
probabilists developed a specialized language of their own, which it is 
now possible to translate into the language of measure theory. Proba- 
bility language has not been dropped because it adds intuitive content to 
the subject and also makes it more accessible to workers in applied fields. 

It is customary in analysis to use the same notation for a function as 
for its value at a given point of its domain of definition, and the ambiguity 
has its compensations. It will be necessary to be somewhat more precise 
when we deal with functions in this book. A function will usually be 
denoted by a single letter, and the usual functional notation will be 


6 INTRODUCTION AND PROBABILITY BACKGROUND I 


reserved for the value of the function at a given point. Thus 2() is the 
value of the function x at the point œ. As usual, it will sometimes be 
convenient to denote the function x by 2(-). 

The function F defined by (3.1) is called the distribution function of the 
random variable x. It is monotone non-decreasing, continuous on the 
right, and 
(3.2) lim FQ)=0, lim FQ) =1. 

doo 4o 
Any function F satisfying all these conditions will be called a distribution 
Junction. A distribution function defines a distribution, that is, a proba- 


bility measure 
dF(A) 
l 


of sets A. This is the usual Lebesgue-Stieltjes measure defined by F. 
If F is a distribution function, and if, for some Lebesgue measurable 
and integrable function f, 
a 


(3.3) F) = Í Kudu, —o<1<o, 


ca 


fis called the density function corresponding to F. We have F(A) = f(A) 
for almost all 2 (Lebesgue measure). The statement that F has a density 
will be understood to imply the existence of f satisfying (3.3), that is, 
there is a density if and only if F is absolutely continuous. 

If 2, * * +, x, are real random variables, the function defined by 


(3.4) F(A, * * + A,) = P{x,(w) TS fe CY n} 


is called their multivariate distribution function. The function F defined 
by (3.4) is monotone non-decreasing and continuous on the right in each 
variable, with 


Un FA a A = flr 


pp 
lim Fle 34) =1 
day 24) Age 
Moreover if A; < up j = 1,* + +, n, then 
n 
Eline TaS 2 F(tlyy* + ts Hi- Age Matas * "> Pn) 
yrs 
Fria 2 Fly s Hio Ap Hai" * ° Prav As Urto” * "s Hn) 


eat A E EAA o Aa) == Oe 


ay 


§3 RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 7 


The quantity on the left is the evaluation in terms of F of 
PLA, < x(w) <p; j =1,° + +, nh. 


Any function F satisfying all these conditions will be called an n-variate 
distribution function. Such a function defines a probability measure 


Poo) fis yas? + Ads 
A 


Lebesgue-Stieltjes measure in dimensions. (See Supplement Example 
2.2.) If, for some Lebesgue measurable and integrable /, 


is a 


(3.5) Fay? 5a) = j cate fron + udi + dit, 


-o 


for all Ai, * + +, 4,, then fis called the corresponding density function. 
The real random variables £y, * * +, #, are said to be mutually indepen- 
dent if 
n 
(3.6) Pizo) e Xp j= 1,7 +m = [| PELO) € X) 
j=1 


for all linear Borel sets Xj * ‘, X,- Equivalently one can require (3.6) 
only for open sets X, Or intervals, or even semi-infinite intervals 
(— œ, bj]. It follows that the random variables are mutually independent 
if and only if their multivariate distribution function is the product of 
their individual distribution functions. 

If x, in the preceding paragraph represents the my variables jy" * "s jm,» 
and if X, is correspondingly an m dimensional set, the preceding para- 
graph furnishes the definition of mutual independence of finitely many 
aggregates each containing finitely many random variables. If the 
aggregates may contain infinitely many random variables, the aggregates 
are said to be mutually independent if whenever each infinite aggregate is 
replaced by a finite subaggregate these subaggregates are mutually inde- 
pendent. Infinitely many aggregates of random variables are said to be 
mutually independent if, whenever Ay, Ag, * * * are finitely many of the 
given aggregates, A, Ag, * + * are mutually independent. , 

The preceding definitions are applicable to complex random variables 
if we allow the X,’s in (3.6) to be two-dimensional Borel sets, that is, 
Borel sets of the complex plane, making corresponding conventions in the 
later discussion. Or the preceding definitions are applicable directly, if 
the convention is made that a complex random variable is to be considered 
a pair of real random variables, its real and imaginary parts. 


8 INTRODUCTION AND PROBABILITY BACKGROUND I 


Let x be a random variable. Then its integral over wœ space Q, if the 
integral exists, will be denoted by Efx}, or E{x()}, and called the 
expectation of x, 


(E) Efa} = | vaP. 
Q 
We recall that this integral exists if and only if the integral of || is finite. 


4, Convergence concepts 
Let x, 2, Ta * + + be random variables, If 


lim 2,(@) = x(w) 


for almost all œ, we say that 
lim 7, =% 


n> 


with probability 1. There is such convergence if and only if 
(4.1) lim P{L.U.B. |2,,(0)— x,(@)| > e} = 0 
nso mzn 


for every £ > 0. 
The x,, sequence is said to converge stochastically to x if the weaker 
condition 


(4.2) lim P{|x,(@)— x(w)| > e = 0 


is satisfied for every s > 0. Stochastic convergence, denoted by 


(4.3) p lim w, = 7, 
no 
is also known as convergence in probability and as convergence in measure. 
The relations between convergence with probability 1 and convergence in 
probability are well known, and will not be proved here: 
(a) convergence with probability 1 implies convergence in probability; 
(b) plim x, =a if and only if every subsequence {æx,,} of the x's 


contains a further subsequence which converges to x with probability 1. 
The sequence {x} is said to converge to x in the mean if E{|x,|?} < o 
for all n, if E{|x|®} < 00, and if 


(4.4) lim E{|x— x,|?} = 0. 


This is written 
(4.5) © lintz, =a. 


no 


§5 FAMILIES OF RANDOM VARIABLES 9 


Convergence in the mean implies convergence in probability. In fact, if 
(4.5) is true, 


(4.6) lim P{{2,(«w)— ax(w)| > £} < lim 


E{|2,— 27} 
nw no 2 a 
for all ¢ > 0, 


Let {F,,} be a sequence of one-dimensional distribution functions. The 
most appropriate definition of convergence to a distribution function F 
is the condition that lim F,(2) = F(A) at every point of continuity of F. 


n> 


If this condition is satisfied, the convergence will necessarily be uniform 
in any closed interval of continuity of F, finite or infinite. The definition 
of distance between two distribution functions G, and G, that matches 
this convergence is the following. The graphs of G, and G, are completed 
with vertical line segments at points of discontinuity, and the distance 
between G, and G, is defined as the maximum distance between points 
of the two completed graphs measured along lines of slope —1. Under 
this definition the space of distribution functions is a complete metric 


* space and lim F,(A) = F(A) at all points of continuity of F if and only 


n> 


if the distance between F,, and F goes to 0 with I/n. 

If the sequence {F,} is the sequence of distribution functions of the 
sequence of random variables {w,}, and if the distribution functions 
converge to a distribution function in the sense just described, the sequence 
of random variables is sometimes said to converge in distribution. This 
terminology is rather unfortunate, since the sequence of random variables 
may not then converge in any ordinary sense. For example, whenever 
the distribution functions are identical and the random variables otherwise 
arbitrary, say mutually independent, the sequence of random variables 
converges in distribution. 


5. Families of random variables 

In some discussions, random variables are defined directly. For 
example, if œ space is the line, if the ‘measurable w sets are the Lebesgue 
measurable sets, and if probability measure is defined by 


1 
5.1 P{4} = = |a 
6.1) E r 
A 
then the sequence of functions 1, œ, w, + - + is a sequence of random 


variables. However, in many discussions the specific œ space involved 
is irrelevant, and only the existence of a family of random variables with 
specified properties is required. In this case, the common situation is 


10. INTRODUCTION AND PROBABILITY BACKGROUND I 


that the multivariate distributions of finite aggregates of the random 
variables are specified, and it is understood that there exists a family of 
random variables with the specified distributions. Thg present section is 
devoted to a discussion of this point. In II §2 the possibility of prescribing 
further distributions will be discussed. 

For example, when a theorem has the initial phrase “let a, * * *, v, be 
mutually independent random variables with distribution functions 
F> + +, F,,” the theorem would be trivial without the existence theorem 
that there are n random variables with the stated properties defined on 
some space, It will clarify the following general discussion if we outline 
here a proof of this existence theorem. Let œ space be the n-dimensional 


space of points w: (n's Ën). A probability measure is determined 
on the Borel sets of œ space by 
P(A} = [>= + [ AFED + dF AEs) 

A 
(see §3). Let 2, be the jth coordinate variable, that is, x,(w) = &; if w is 
the point (é,,- +, &,). The 2,’s are then mutually independent random 
variables with the respective distribution functions Fy, * * *, Fn. We have 
thus proved that there are random variables xy, ` © +, #, with the desired 


properties, and incidentally that the random variables can be taken as 
the coordinate functions in n-dimensional space. The point of the 
following paragraph is the generalization of the method used here to the 
most general aggregates of random variables. 

In general, a class of subscripts T is given, and a class of random 
variables {x,, t e T} is to be defined. For every finite f set 4, ** s fm 
the multivariate distribution function F,,...,, of a * +‘, %, is pre- 
scribed. It is obvious that the distribution functions prescribed must be 
mutually consistent in the sense that if a, °° "s, æ is a permutation of 
1, =- +, n, then 


Be... Ay e A,)= F, 5 ta Cao Baw 5 da)» 


aye 
and that, if m < n, 
Jas SA cee Se Geese A E lim Fes cay at 


Ayr 
gmt, ++, 


Kolmogorov proved (see Supplement Example 2.3) that these consistency 
conditions are the only conditions that need be imposed. The «,’s are 
defined as follows. The space Q is taken as the space of points œ: 
(£, t € T), where &, is any real number, that is, Q is the space of functions 
of t eT, or, from another point of view, Q is coordinate space, whose 
dimensionality is the cardinal number of T. The value of a ¢ function 


§5 FAMILIES OF RANDOM VARIABLES 11 


at t = s defines an w function x,, if we set (œ) = &,, where w is the 
function é). The w set 


{a (0) <4, j= 1,- +, n} 
is assigned as measure the given number 
Fay C EE 
more generally the w set 
{[%,(@), Te: x ()] € A} 
is assigned as measure (if A is an n-dimensional Borel set) 


etsy alee apes a ETEN 
A 

It is proved that this assignment of measures to œ sets of the specified 

type determines a probability measure on the Borel field of w sets generated 

by the class of sets of this type. The family of œ functions {a,, t « T} is 

then a family of random variables with the assigned distributions. 

The basic w space Q was defined in this discussion to be the space Qy 
of all functions of t e T, so that the family of random variables obtained 
was the family of coordinate functions of a Cartesian space. Even if the 
basic space of a family of random variables {x,, t € T} is not this Cartesian 
space, however, we shall see in §6 that the family can be replaced by such 
a family, for many purposes. 

Let {x,, t e T} be a family of random variables. Then for fixed œ the 
function value x,(w) defines a function of £. The ¢ functions defined in 
this way will be called the sample functions of the family. In particular 
if Q is caordinate space Qy, and if x; is the rth coordinate function, the 
concepts of basic point and sample function coincide. In any case it 
is frequently convenient to use such phrases as “almost all sample func- 
tions,” involving measure concepts and sample functions, with the con- 
vention that the measure concepts are referred back to Q. For example, 
the phrase “almost all sample functions” means “almost all œ.” If the 
parameter set T is finite or enumerably infinite, it will usually be more 
appropriate to speak of a sample sequence than of a sample function, and 
we shall do so. 

Finally we remark that, when a family of distribution functions is used 
as. at the beginning of this section to define a measure of Qp sets, it may 
sometimes be useful to take Qp not as the space of all functions of ż, £ € T 
but as this space with the restriction that the values of the function lie in 
a given set Y. (We shall not find it necessary in this book to have X 
depend on the parameter value t.) All the considerations of the above 
work remain valid if X is a Borel set of the infinite line -coo < x < œ% 


12 INTRODUCTION AND PROBABILITY BACKGROUND I 


and if the given distributions restrict each of the random variables to X, 
with probability 1. 

To illustrate the preceding discussion we apply it to the analysis of the 
repeated tossing of a die. Suppose that n tosses are to be considered. 
The parameter set T will then be taken as the set of integers 1, ©" +, n. 
The space Qy is the n-dimensional space of points (Gees Ea) Any: 
point whose coordinates are all integers between 1 and 6 inclusive is given 
measure 1/,", and the measure of any set of points is the number of these 
special points in the set, divided by 6". In this way we obtain a mathe- 
matical model of n tosses of a (balanced) die (see also the discussion in 
§2), and the jth coordinate of a point of Qp is a random variable, corres- 
ponding to the number obtained as a result of the jth toss. It is clear 
that Qp has unnecessarily many points in it. The point (0, > + +, 0) for 
example, corresponds to no experimental result and simply clutters up 
the mathematical model. It is for this reason that the set X was intro- 
duced in the preceding paragraph. If we take X as the set of integers 
1, 2, 3, 4, 5, 6, the w space becomes the space of 6” points in n dimensions 
whose coordinates are all integers between 1 and 6 inclusive. The 
measure assigned to any set is the number of its points divided by 6”. 


6. Product space representations 


Let a be a real random variable, with distribution function F. Then 
(see Supplement Example 2.1) F determines a Lebesgue-Stieltjes measure 
of linear sets defined by X 

P(A} = [ aF). 
A 
The measurable linear sets are the sets measurable with respect to F. 
These sets include all the Borel sets. In terms of our probability measure 
on the line, we can interpret an event determined by a condition on v, 
say x(w) e A, as the linear point set A rather than the w set {x(w) € A}, 
and the linear measure has been defined to make the probability the same 
for the two interpretations. The formal elaboration of this idea is to set 
up a measure preserving transformation from w space to the line. This 
is done in detail in the Supplement (see especially Example 3.1), but the 
ideas in this section should be clear even without that detailed discussion. 
Let & be the coordinate random variable on the line. That is, # is the 
function which takes on the value é at the point with coordinate é. 
Suppose that in some investigation one is only concerned with random 
variables of the form (2), where ® is a Baire function of a real variable, 
or more generally is measurable with respect to F. Then it is possible, 
and frequently convenient, to replace the original basic space by the 
line, using the F measure defined above. For example, an w random 


§6 PRODUCT SPACE REPRESENTATIONS 13 


variable D(a) then becomes a random variable ®(%) defined on the line, 
and these two random variables, defined on different spaces, have the 
same distribution. More generally, if ®,(x), + - -, ®,(x) are œ tandom 
variables of this type, they become random variables on the line, and the 
multivariate distributions of the two sets of random variables, defined on 
different spaces, are the same. As one application of this idea, consider 
the expectation of ®(x), where ® is as above. From the point of view 
of w space, E{@(x)} is defined by 


lj O[x(o)] dP. 
Q 


From the point of view of the new basic space Õ, the ¢ axis, the random 
variable in question is (%), and its expectation is defined by 


| o@ are. 


(The equality of these two evaluations is proved in the Supplement, §3.) 
Thus it is unnecessary to revert to the original basic space to calculate the 
expectation of the random variable. 

Let a, y be real random variables with bivariate distribution function 
F. Then, just as in the one-dimensional case, F determines a probability 
measure of plane sets, 


P(A} = |f di EE, n) 
A 


for A measurable with respect to F. Suppose that in some investigation 
one is only concerned with random variables D(a, y), where ® is a function 
of two real variables, measurable with respect to F. Then, generalizing 
the one-dimensional case discussed in the preceding paragraph, it is 
frequently convenient to replace the original basic space £2 by ©, the 
ë, n plane. The point is that the original œ probabilities induce a proba- 
bility measure on the space Q, by a point transformation which is measure 
preserving, and that for many purposes the new space is more convenient 
than the old one. As in the preceding paragraph, E{®(x, y)}, defined in 
terms of w space and measure by 


| dizo), yo)] dP, 
Q 


is defined in terms of @ space and measure by 


© 


| fone, FG», 


-%9 -0o 


and the two evaluations are equal. 


14 INTRODUCTION AND PROBABILITY BACKGROUND I 


In general, let {a,, tf eT} be any family of random variables, and let 
B(x, te T) be the smallest Borel field of œ sets with respect to which the 
u,/saremeasurable. That is, A(x,, t eT)is the Borel field of sets generated 
by the class of w sets of the form {x,(@) € X}, where te T and X is an 
interval. Then it is frequently convenient to replace the original basic 
space Q by © = Qp, the space of functions of £ e T. It is shown in the 
Supplement, §3, that a probability measure can be defined on function 
space, as described in §5, in such a way that, if {%,, re T} is the class of 
coordinate variables in this space, every finite set of s has the same 
multivariate distribution as the set of corresponding xps. In fact there 
is a transformation taking the œw random variables measurable with respect 
to B(x, t eT) into & random variables, in a one to one way (if we consider 
as identical random variables which are equal with probability 1), and 
satisfying the following conditions: 


(i) the transformation takes æ, into #, and Baire functions of a finite 
number of 2's into the same Baire functions of the corresponding @,’s; 


(ii) if vis an œ random variable going into the & random variable 2, 
then if the expectation of either random variable exists, that of the other 
exists also, and the two expectations are equal; 


(iii) if x is an œ random variable, taking on the value 1 on a set 
A ¢ (x, t € T), and O on the complement of A, then the transformation 
takes a into an @ random variable # taking on the value | ona measurable 
6 set A, and 0 on the complement of this set; the set transformation 
defined in this way is one to one (if we consider as identical sets which 
differ by sets of measure 0) and measure preserving. 


Thus any problem involving œ random variables measurable with respect 
to Ba, te T), or sets of this Borel field, can be expressed as the corre- 
sponding problem in terms of random variables. The class of problems 
that can be considered can be slightly enlarged by completing the proba- 
bility measure defined on the sets of A(x,,t«T). We refer to the 
Supplement for a rigorous treatment of the mapping involved here. The 
particular cases in which T consists of only a single point, or of exactly 
two points, have been discussed separately at the beginning of this section. 
Problems involving n random variables can be reduced to problems 
involving n-dimensional coordinate space, and problems involving 
infinitely many random variables to problems involving infinite dimen- 
sional coordinate space. In each case the original random variables 
become the coordinate functions of the new space. 

As an example of the application of this idea when T consists of 
two points, consider the theorem that, if œ and y are mutually 


§7 CONDITIONAL PROBABILITIES AND EXPECTATIONS 15 


independent random variables, and have expectations, then E{«y} exists 
also, and 


(6.1) E{xy} = Efx}E{y}. 


At first sight this does not appear to be a standard integration theorem, 
and in fact it is sometimes treated as a very special theorem of the theory 
of probability. But note that it is a theorem concerned only with the 
two random variables x and y, so that a representation on the plane is 
admissible. In this representation we shall see that the theorem becomes 
a standard integration theorem. If x, y have distribution functions G, H 
respectively, and if F is the distribution function of the pair, then 


F(E, n) = G(E)HO), 


because the random variables are independent, and (6.1) becomes, in 
terms of the plane representation, 


(6.2) i fen do atin) = | feace aon | [nace ano. 


The double integrals on the right reduce to single integrals, so that (6.2) 
becomes 

(6.3) [ f En dO) ana = f dG) f nancy. 

Thus (6.1) becomes the evaluation of a double integral by iterated integra- 
tion. Any direct proof of (6.1) must be equivalent to a justification of 


this standard evaluation. 
In the preceding discussion we have assumed that the random variables 


were real. The extension of representation theory to complex valued 
random variables is obvious, and the details will be omitted. 


7. Conditional probabilities and expectations 

Let y be a random variable and let M be a measurable œ set. We wish 
to define the conditional probability of M, and the conditional expectation 
of y, relative to various specific conditions. Before doing so we consider 


two special cases. 
Case | Suppose that a random variable a takes on only a finite or 


enumerable sequence of values ay, a», * * *. The conditional probability 
of M if x(w) =a; which we denote by P{M| a(w) = aj}, is defined, 


whenever P{x(w) = a;} > 0, by 
P{w e M, z(o) = aj} 
(7.1) P{M | 2() = a;} eee 


16 INTRODUCTION AND PROBABILITY BACKGROUND I 
In particular if y takes on only the values by, by, + +, we obtain in this 
way the conditional distribution of y for x(w) = 4;, 


P Oa = 
AD PPO b lao) =a = E - 


The conditional probability P{M | x(w) = a; depends on a;, that is, it 
defines a function of the values taken on by the random variable x. If 
we substitute x(w) for its value in the definition of this conditional proba- 
bility, we obtain a random variable z, defined by 

2(w) = P{M | xw) = a;} where x(w) = 4j, 
if Pfa(w) = aj} >0. We define z(o) arbitrarily on the w set {a(w) = a;} 
if this set has probability 0, The random variable z is thus defined 
uniquely, neglecting values on an set of probability 0. The conditional 
probability of M relative to x, denoted by P{M | 2} is defined as the 
random variable z, or more precisely as any one of the versions of z. 
Then 


(7,3) P{M | 2} any = P{M | 2() = aj) 
if P{a(w) =a} >0. Let A be any set of a;’s, and define 
A = {a(@) € A} v, {a(w) = aj}. 
ajes 


Then we observe that 
P{AM} = 5 P(M | x(w) = a} Pla) = a) 
ajeA 


This equation can also be written in the form 


(1.4) P{AM} = | PM | 2} dP. 


A 


Similarly the conditional expectation of y relative to x, denoted by 
Ely | x}, is defined as a random variable for which 


Ey [PB leoy=a; = Efy | aw) = a} = 2 b, Ply) = by | x(w) = a; 
if P{x(o) = a} > 0. The equation 
2 by P{y(w) = br 2(@) € A} = 2 Efy | x(w) = a} Pia(w) = a;} 


follows at once from our definitions. This equation can also be written 
in the form 


(1.5) fyap= | Ely La} aP. 
A A 


§7 CONDITIONAL PROBABILITIES AND EXPECTATIONS 17 


Equations (7.4) and (7.5) inspire the general definitions of conditional 
probability and expectation to be given below. 

Case 2 Suppose that Q is the &, 7 plane, that the measurable o sets 
are the Lebesgue measurable sets, and that the given probability measure 
is determined by a density function, 


P{A} = J J fE, m dE dn. 
A 


Then the abscissa and ordinate variables determine coordinate functions 
x and y taking on the values &, 7 respectively at the point (&, 7). These 
functions are random variables whose joint distribution has density f. 
We define a new density (in the variable 7) by 


_ fn) 
[IED 


for each ¢ for which the denominator does not vanish. In analogy with 
the discussion of Case 1, it appears natural to describe the distribution 
obtained in this way as the conditional distribution of y for (w) = &, 
and to describe as the conditional expectation of y for (w) = & the ratio 


a (E, 0) dy 


Fren SE, o) dn 
(We make the assumption Ef|y|} < 00.) These descriptions will be con- 
sistent with the general definitions to be given below. 

General case Instead of defining conditional probabilities and expecta- 
tions relative to a given random variable, we define slightly more general 
concepts and particularize. Tt will be seen that a conditional probability 
is a special case of a conditional expectation, and we therefore define the 
latter first. 

Let y be a random variable whose expectation exists, and let 7 be a 
Borel field of measurable w sets. Let F’ D F be the Borel field of those 
@ sets which are either F sets or which differ from F sets by sets of 
probability 0. We recall (see Supplement, Theorem 2.3) that if a random 
variable is measurable with respect to F’ it is equal with probability 1 to 
a random variable measurable with respect to F. The conditional 
expectation of y relative to F, denoted by Efy |F}, is defined as any w 


18 INTRODUCTION AND PROBABILITY BACKGROUND I 


function which is measurable with respect to 7’, which is integrable, and 
which satisfies the equation 


(7.6) [Ely |F}aP = fyar, NeF. 
A A 


(It will necessarily also satisfy this equation for A eF’, because of the 
relation between .F and Z’. Thus the definitions of the conditional 
expectations Efy |. F} and Efy | F'} are identical.) Note that the right 
side of (7.6) defines a function of A e.F which is completely additive and 
which vanishes when P{A}=0. Hence, according to the Radon- 
Nikodym theorem (see Supplement, §2), this function of A can be ex- 
pressed as the integral over A of an w function measurable with respect 
to.F. This œ function is thus one possible version of Efy |7}. How- 
ever, according to our definition, any « function equal almost everywhere 
to this one is also a possible version of the conditional expectation. 
Conversely, according to the Radon-Nikodym theorem, any two versions 
of the conditional expectation are equal almost everywhere. Thus we 
have defined Efy |F} as any one of a class of random variables. Any 
two random variables in the class are equal almost everywhere, and any 
random variable equal almost everywhere to a member of the class is 
itself in the class. In any expression involving a conditional expectation 
it will always be understood, unless the contrary is stated explicitly, that 
any one of the versions of the indicated conditional expectations can be 
used in the expression. 

Let M be a measurable w set, and let Z be a Borel field of measurable 
w sets. Let y be the random variable defined by 


y(m) = 1, weM 
= 0; weQ— M. 


Then the conditional probability of M relative to F, denoted by P{M | F}, 
is defined as Efy | F}, that is, as any one of the versions of this conditional 
expectation. The conditional probability is thus any œ function which 
is either measurable with respect to F, or equal almost everywhere to an 
w function which is, which is integrable, and which satisfies the equation 


(7.7) | P{M | F}dP = P{AM},  AcF. 

A 
The preceding definitions are somewhat simplified if 7 includes all sets 
of measure 0, because in that case 7 =F’ and conditional probabilities 
and expectations relative to F are necessarily measurable relative to F. 
However, we have seen that in every case there is a version of Efy | 7} 
which is measurable with respect to F. 


$7 CONDITIONAL PROBABILITIES AND EXPECTATIONS 19 


Now let {x,, teT} be any family of random variables. Let 
F = Bx, teT) be the smallest Borel field of w sets with respect to 
which the x;s are measurable (that is, F is the Borel field generated by 
the class of sets of the form {x,(w) « A} where A is a Borel set) and let 
F’ be the Borel field of those w sets which are either F sets or which 
differ from F sets by sets of probability 0. In this book we shall describe 
an w set in F’ as a measurable set on the sample space of the x,'s, and 
we shall describe an œ function measurable with respect to F’, that is, 
one equal almost everywhere to a function measurable with respect to F, 
as a random variable on the space of the xs. In particular, if T consists 


of the integers 1,- - -, n, the œ set A is measurable on the sample space 

of the x,’s if and only if it differs by at most an w set of measure 0 from 

one of the form yk 
{[m(o), + > +, @,(@)] € 4}, ae 


where A is a Borel set (n-dimensional in the real case, 2n-dimensional in 
the complex case), and the œ function x is a random variable on the | 
sample space of the 2,’s if and only if x is equal almost everywhere to ye 
Baire function of æy, * * *, %,. (See Supplement, Theorem 1.5.) 

Let y be any random variable whose expectation exists, and let M be 
a measurable œ set. The conditional expectation [probability] of y [M] 
relative to the s, denoted by 


Efy|x,teT}  [P{M |2, teT}), 
is defined as 
Efy |F} [PM |F) 


that is, as any version of the latter conditional expectation [probability]. 
Here F is defined as in the preceding paragraph. As always, we can 
replace ¥ by F’ in this definition. Thus the conditional expectation in 
question is defined as any random variable which is measurable with 
respect to the sample space of the z,’s and which has the same integral as 
y over every set measurable on the sample space of the xps. When F 
is defined in this way, (7.6) and (7.7) can be put in a slightly more con- 
venient form by restricting the class of w sets on which the equations are 
to hold, The left and right sides of these equations define completely 
additive functions of A, and such functions are completely determined 
by their values on any subfield Fy of F which generates F, (See 
Supplement, Theorem 2.1.) In the present case F, can be taken as the 
class of œ sets which are finite unions of sets of the form 


{u,(w) € Xj, jf = leee n} 


where (fs * * *, ta) is any finite subset of T and the X,’s are Borel sets. 


> 


Thus it is sufficient if (7.6), or (7.7) as the case may be, is satisfied for A 


. 


20 INTRODUCTION AND PROBABILITY BACKGROUND I 


of this type. Since the sides of (7.6) and (7.7) are additive in the inte- 
gration set, it is sufficient to verify them for the above individual sum- 
mands. If convenient we can even suppose that the X;s are right 
semi-closed intervals (or open intervals, or closed intervals.) 

In particular, suppose that T in the preceding paragraph consists of the 
integers 1,-- » k. Then we have defined 


Efy |%,° + *, te}, P{M | ay,° > + th 


in a way consistent with the earlier discussions of Cases | and 2. In fact 
(7.5), derived in Case 1, became the defining property (7.6) in the general 
case. Consider a version of the conditional expectation which is measur- 
able with respect to F = A(a,,-* +, x). Then we have seen that this 
version can be written in the form 


Ely |2 + * ty} = Den: > +5 Xe), 


where ® is a Baire function of k variables (Supplement, Theorem 1.5). 
If such a version is used, we sometimes write 


Efy | zw) = &;, j=l, +k} 
for 
Ey |z 5 Eele- j1, EOE E Sy): 


In particular, if we use a version of P{M | æx;,* * +, 2} which is a Baire 
function of %4, * * *, &p, We shall sometimes write 


P{M | a(w) = &, 7 = lk 
for 
PM |z: 33 |a(w)=8,5=1, + kt 


The discussion we have given justifies the common description of the 
tandom variables 


Ey |z,,teT},  P{M |x, teT} 


as the conditional expectation of y and probability of M respectively for 
given values of the x;s, or for given x(w), t eT. 


8. Conditional probabilities and expectations: general properties 


Let y be a random variable whose expectation exists, and let F, Y be 
Borel fields of measurable w sets. Let Z’ [Y’] be the Borel field of those 
@ sets which are either F [Y] sets or which differ from such sets by sets 
of probability. 0. Suppose that Z’ CF’. Then Efy |F} and Efy | 9} 


nec Wolo. 


§8 CONDITIONAL PROBABILITIES AND EXPECTATIONS 21 


are not necessarily equal with probability 1. The second conditional 
expectation is a coarser averaging than the first. More precisely, the two 
conditional expectations have the same integrals as y over Y sets, but 
the first one is not necessarily measurable with respect to Y’. However, 
if the first one happens to be measurable with respect to Y’, the two 
conditional expectations are equal with probability 1, according to the 
following theorem. 

THEOREM 8.1 Suppose that Y' C F’ and that some (and therefore every) 
version of Efy |F} is measurable with respect to 9’. Then 


(8.1) Efy |F} = Ely |9} 


with probability 1. 

To prove (8.1) we need only remark that Efy |.7} is measurable with 
respect to Y by hypothesis, and has the same integral over 4 sets as y, 
since it even has the same integral over F’ sets as y. Thus Efy | F} 
satisfies the conditions determining E{y | 4}. 

THEOREM 8.2 Let {2,, t e T} be a family of random variables, and suppose 
that T is non-denumerable. Then, ify is a random variable whose expecta- 
tion exists, there is a denumerable subaggregate {t,, n > 1} of T (depending 
on y) such that 
(8.2) Ey | x, teT}= Ey | Tip Ti" * } 
with probability 1. 

If S is any subset of T, let F g = B(x, t € 8) be the smallest Borel field 
with respect to which the ,’s with ¢ € S are measurable. The left side of 
(8.2) is by definition a version of E{y |. Fy}. From now on we shall 
denote by Efy |F r} a particular version of this conditional expectation 
which is measurable with respect to.F p. By Theorem 1.6 of the Supple- 
ment, since this conditional expectation is measurable with respect to F r, 
there is a denumerable subset S of T such that Efy | Fr} is measurable 
with respect to Fg C.F p. Then by Theorem 8.1 


E{y | Fx} = Ely |F s} 


with probability 1, as was to be proved. 
If F is a Borel field of measurable w sets, if z is an w function measur- 
able with respect to F, and if E{|z|} < 00, then 
E{z | F} =z 
with probability 1. In fact, z has the defining properties of the indicated 
conditional expectation. More generally, we prove the following 
theorem. 


22 INTRODUCTION AND PROBABILITY BACKGROUND I 


THEOREM 8.3 [fy isa random variable, if z is an w function measurable 
with respect to the Borel field F of measurable w sets, and if 


Ejlyl}.< 0,  Eflzy|}< 0, 
then 


(8.3) Elzy |F} = 2Ely |F} 
with probability 1, and 
(8.3) E{ly— Ely | F314} = 0. 


Equation (8.3/) is a trivial consequence of (8.3). Note that its validity 
for z taking on the values 0 and | is the defining property of the con- 
ditional expectation Efy |.7}. To prove (8.3), we have only to prove, 
according to the definition of conditional expectation, that 


(8.4) | zy dP = J zE{y |. F} dP, NeF. 
A A 


Now if z(w) is 1 or 0 according as w is or is not a point of M e.F, this 
equation becomes 


| ydP = fey | F} aP, 
AM AM 


and this equation is true by the definition of the integrand on the right. 
It follows at once that (8.4) is true if z is a linear combination of functions 
of the type just considered, that is, if z takes on only a finite number of 
values, each on an F set, and the general case follows using the usual 
approximation procedure. 

The most useful particular case of (8.3) is 


(8.3”) E{O(x)y | x} = O@)Efy | a} 


with probability 1, where x and y are random variables, ® is a Baire 
function, and : 


Efjyl} < 2, — Ef{|®(x)y|} < oo. 


In §7 the conditional expectation of y was defined as the integral of y 
with respect to a conditional probability measure in two simple cases 
(Cases | and 2). Although, as we shall see, this definition is not always 
possible, because P{M |F} cannot always be considered a probability 
measure in M for fixed w, nevertheless E{y |.7} considered as a functional 
of y has, many of the properties of an integral. The following facts 
illustrate this assertion. (The y’s are random variables and F is a Borel 
field of w sets throughout.) 


§8 CONDITIONAL PROBABILITIES AND EXPECTATIONS 23 
CE, E{1 |F} = 1 with probability 1. 
CE, Ify > 0, then Efy |. A} = 0 with probability 1. 
CE; If cj," * +, C, are constants, 


ED cH |F} = 2 Ely; |F), 
= j- 
with probability 1. 

CE, Ey | F}| < E{ly| |F} 
with probability 1. 

CE, If lim y, =y with probability 1, and if there is a random 

nr 
variable a > 0, with Efx} < 00, such that 
ly(o)| < 2) 
with probability 1, then 
lim Ey, |F} = Ely |F) 
with probability 1. 

In §9 will be shown how to derive these results from the corresponding 
integration theorems, using representation theory, It may be instructive 
to derive them directly here, however. Properties CE,, CE,, CE, are 
immediate consequences of the conditional expectation definition, To 
prove CE, we suppose first that y is real. Then, using CE,, 

E{|y| — y |F} = 0, 
Ef|y| +y |F} 20, 
with probability 1. Hence, using CEs, 
Efy |F} < Efly| |F} 
—E(y |F} = E{— y |F} < Elly| |F} 
“with probability 1, so that CE, is true. If y is complex-valued, choose 
the random variable z to satisfy 
2(w) = 0, if E{y |F} = 0, 
a@Efy |F} = Ey |F}, if Ely |F} #0. 


(Here Efy |F} is a particular version of this conditional expectation, held 
fast in the following.) Let z, be the real part of zy. Then, using the fact 
that z is measurable with respect to F, , 


Ey | F3| = Ely |F} = Eley |F) = Efa |F) 


24 INTRODUCTION AND PROBABILITY BACKGROUND 1 


with probability 1. Since z, is real, we can apply the real case of CE,, 
already proved, to continue this inequality, obtaining, since 


lao] < ly), 
the desired inequality 


IE{y LF} < Efla] 17} < Elly] 17} 
(with probability 1). To prove CE;, define 7, by 
alo) = L.U.B. lyw) — y(o)|. 
jen 


Then 
Go) > Flo) >---=0, j lw) < 22(w), 


with probability 1, and 
lim f (w) = 0 
with probability 1. Because of CE, and CE3, 
[Ey | F}— Ey, |F) = Ely yn |F3| < Elly— gol 17} 
SEG, lF) 
with probability 1. Hence it is sufficient to prove that 


lim Ef, | F}=0 
Ee 


with probability 1. We have, from CE, and CEs, 
Eh, |F} > Eff | Fi} > 20 
with. probability 1, so that there must be convergence, 


lim Ef, | F} =w 
no 


with probability 1. Moreover, by definition of conditional expectation 
[(7.6) with A = QJ, 


E(w} < EEG, | 7} = Eff) 


and the right side goes to 0 when n —> œ because it is the integral of 
Ün where g, is dominated by 2x, and goes to 0 with probability 1 when 
n—> œ, Thus Efw} = 0, so that w = 0 with probability 1, as was to 
be proved. 

The listed properties of conditional expectations imply the following 


§8 CONDITIONAL PROBABILITIES AND EXPECTATIONS 25 


properties of conditional probabilities. (The M’s are measurable w sets 
and F is a Borel field of measurable œ sets.) 


cP, 0<PM|F}<1 
with probability 1. 
CP, If P{M} = 0, then P{M |F} = 0 with probability 1; 
if P{M} = 1, then P{M |7} = 1 with probability 1. 


CP, If either 
MSM DREAM, =M 
or 
M,C M,C: =>, M,, = M, 
then £ 
lim P{M, |F} = PM |F} 


with probability 1. 

CP, IfM;, M», > > > are disjunct, and finite or denumerably infinite in 
number, 

PUM, | F} = 5 PM, |F} 
n n 
with probability 1. 

With the help of the listed properties of conditional probabilities and 
expectations, we can derive a suggestive evaluation of a conditional 
expectation. 

THEOREM 8.4 Lety be a random variable whose expectation exists, let 
F be a Borel field of measurable wœ sets, and let 6 be a positive number. 
Then the sum 


z= 5 (U + DORY < yo) < (G + D8 |F} 


is absolutely convergent; with probability 1, and if lim 6, = 0 it follows 
that DS 

ae X= Ely |F} 
with probability 1. 

Note that this theorem does not state that conditional probabilities 
relative to F determine a conditional probability measure for fixed w, 
and that Efy | F} is the integral of y with respect to the corresponding 
probability measure. The possibility of such an interpretation is discussed 


in §9. 


26 INTRODUCTION AND PROBABILITY BACKGROUND I 


To prove the theorem define y, and y,,, by 


ys) = jo where jd < yo) < (j+ 1)6, j=0, +1, > 
Yn) = yw)  where— nô < y(w) < (n + 1)ô 
=0 otherwise. 
Then 
liM Yn, 3 = Yos lim y; =y, 
n>a 60 


Yn, (O)| <y()| < |y(w)| + 6. 
According to CE;, 
lim E{y, 5 |F} = Ey, |F} 
no 


with probability 1. Since 
Elun |F}= 3 PY < vo) < (j+ 09 |F) 
with probability 1, we have proved that 
lim Š jô Pj < yo) < U+ Dò |F) = Efu |F} 
with probability 1. Applying this result to |y|, we deduce that the infinite 


series whose nth partial sum appears on the left in the preceding equation 
converges absolutely, with probability 1, and that therefore 


È (PLO < yo) < + D |F) = Efu, |F}, 


with probability 1. The last assertion of Theorem 8.4 is now an imme- 
diate consequence of CE; applied to the sequence {y,,}. 


9. Conditional probability distributions 

It would be very convenient in the theory of probability if corresponding 
to each Borel field F of measurable w sets there were a function defined 
for every measurable w set M and point w, with value P(M, w) at M, w, 
such that 


CD, For each w, P(M, w) defines a probability measure of M, and, 
for each M, P(M, w) defines an œw function equal almost everywhere to 
one measurable with respect to F. 


CD, For each M, 
PM, w) = PM |F}, 
with probability 1. 


These two properties imply CP,-CP; of §8, but are stronger. 


§9 CONDITIONAL PROBABILITY DISTRIBUTIONS 27 


If there is such an M, œ function, it is not necessarily uniquely deter- 
mined, but any two such functions are, equal with probability 1 for fixed 
M. When there is a function satisfying CD, and CD,, it means that the 
conditional probabilities (conditioned by F) can be defined in such a way 
that they determine a new probability measure for each w, and this 
probability measure, a function of the parameter œw, is called the conditional 
probability distribution relative to F. Unfortunately there may be no 
such conditional probability distribution. 

As an example of a case where there is such a distribution, suppose that 
A; * + +, A, are disjunct measurable œ sets with 


P{A} >0, Ù A =Q. 

Let F be the class of sets which are unions of A,’s. Then one possible 
version of P{M | Z} is given by 
PMA) 
PA 

So defined, P(M, œ) satisfies CD, and CD». 
THEOREM 9.1 If there is a conditional distribution relative to F, then 
if y is any random variable whose expectation exists, one version of 


Efy |F} is given by the integral of y with respect to this conditional 
distribution (as a function of œ); making the obvious notational conventions 


E{y | F} = i y(o’) P(do’, w). 
a 


we A; 


P{M, o} = Hiss OSI 


Let W be the class of random variables y for which this assertion is 
true. Then. includes each random variable which is 1 on a measurable 
% set M and vanishes otherwise, according to CD,. Evidently H isa 
linear class, that is, it includes all linear combinations of its elements. 
Then % includes the random variables which only take on finitely many 
values. Finally, with the help of CE; of §8, we deduce that H includes 
each random variable that is the limit of a sequence of random variables 
Yv Yo ` + in Æ, with the property that there is a random variable x such 
that 

lyn(o)| < z(o), Efx} < o. 


Then # includes every random variable y whose expectation exists, as 
was to be proved. 

Let y;,* * *, Yn be random variables, and let F, = Byis* +s Yn) be 
tHe smallest Borel field of œw sets with respect to which yı, * * *, Yn are 
measurable. Suppose that there is an M, « function defined for Me ¥,, 


28 INTRODUCTION AND PROBABILITY BACKGROUND I 


and all w, and satisfying CD, and CD, forMeF,. If Y is an n-dimen- 
sional Borel set (2n-dimensional if the y;s are complex), define 


P(Y, œ) = PM, o), M = {[y(@),* ` `, yn(@)] € Y}. 


Then p(Y, œ) defines a probability measure of Borel sets, depending on 
the parameter œ. The probability measure of F, sets determined by 
P(M, œw) is called the conditional probability distribution of the y;s 
relative to F. 

THEOREM 9.2 Suppose that there is a conditional distribution of 
Yu ` * `s Yn relative to F, and let be a Baire function of n variables with 
Efl: © Yn} < ©. Then one version of E{®(y,,* * +, Yn) |F} is 
given by the w integral of P(y,, ` ` *, Yn) with respect to the conditional 
probability distribution of the y,’s relative to F. 

To prove this theorem let # be the class of Baire functions ® for which 
the assertion of the theorem is true. Then, by definition of the con- 
ditional distribution, W includes each function which is 1 on a Borel set 
in n dimensions (27 dimensions if the y;s are complex) and 0 otherwise, 
and the proof is then carried through like that of the preceding theorem. 
The particular case n = 1, (x)=, is also trivially deducible from 
Theorem 8.4. 

THEOREM 9.3 If P,(M,@), PM, œ) define conditional probability 
distributions of Yı, * ` *, Yn relative to F, and if pY, œw) is defined as 
above by 

PAY, w) = P(M, w), 
then there is an œw set Ay (which does not depend on Y), of probability 0, 
such that 
PCY, œ) = po Y, ), w ¢ Ao. 


If Y is a Borel set, 
Pily(), * ` yn) € ¥ |F} = p (Y, @) = pY, w) 
with probability 1. There is therefore an œ set A(Y), of probability 0, 


such that 
PLY, o) = pÈ Y, w), w ¢ A(Y). 


Now let Y}, Ys, ++ + be an enumeration of the intervals with rational 
vertices (n-dimensional intervals if the y,’s are real, 2n-dimensional in the 


complex case). Define Ay = O A(Y,). Then 
1 


PAY, œ) = pY, o), o¢ Ao 
if Yisa Y;, and therefore if Y is any interval. The theorem then follows 


from the fact that two measures of Borel sets are identical if they are 
equal when the sets are intervals. 


§9 CONDITIONAL PROBABILITY DISTRIBUTIONS 29 


We have not yet discussed conditions which insure the existence of a 
conditional probability distribution of the random variables y,, * - - y, 
relative to a Borel field. We shall first obtain a preliminary result, Let 
F , be the smallest Borel field of w sets with respect to which the y,’s are 
measurable. In the following, Y will denote a Borel set in n dimensions 
(or in 2n dimensions if the y,’s are complex-valued). We shall call a 
Y, w function p a conditional probability distribution of the y;'s in the wide 
sense, relative to F if p defines an w function equal almost everywhere 
to one measurable with respect to Z when Y is fixed, and a probability 
measure of Y when w is fixed, and if, for each Y, 


(9.1) Pilyo), + +5 Yn) € Y | F} = ply, w) 


with probability 1. Since the œ set involved in the conditional probability 
on the left does not uniquely determine Y, the function p does not define 
for each w a probability measure of F, sets. That is, the existence of p 
does not guarantee the existence of a conditional distribution of y1, * * *, Yn 
relative to 7. However, if p exists, and if ® is any Baire function of 
n variables, with 
E{|DY "$ Yn} < ©, 

then 


OD EO, > sy) LFE= f f De + 1) Pdr, o), 
with probability 1, where the integral on the right is, for each œ, an 
ordinary integral in n dimensions. The proof is exactly the same as that 
of Theorem 9.2. The present result differs from that of Theorem 9.2 in 
that the earlier result involves integration in w space. We shall see that 
the existence of a conditional distribution of y1, * * *, Yn in the wide sense 
is almost as useful as that of a conditional distribution. Like the latter, 
the former is not uniquely determined, but if p, and p are both conditional 
distributions in the wide sense there is an w set of probability 0 such that 
is œ is not in this set 
P(Y, o) = pY, w) 


for all Y. The proof follows that of Theorem 9.3. 

THEOREM 9.4 Let Yı, * + *, Yn be any random variables, and let F be 
any Borel field of measurable w sets. Then there is a conditional distribution 
P Of Yn * ‘+ Yn in the wide sense, relative to F, such that p(Y, w), for 
fixed Y, defines a function measurable relative to F. 

Note that, according to this theorem, if a, + - *, &m are any random 
variables, the conditional distribution relative to 2, ++ -, &m can be 
taken as a function of Y and the xs which determines a Baire function 


30 INTRODUCTION AND PROBABILITY BACKGROUND 1 


of the latter variables when Y is fixed. This follows from the theorem, 
if we take F to be the smallest Borel field of œ sets with respect to which 
the xs are measurable, so that the functions measurable with respect to 
F are the Baire functions of the x's. 

For definiteness, we give the proof for real y;s. It will be obvious that 
the theorem can be formulated, and proved similarly, for enumerably 
many y;s. In the following, F will be any distribution function in n 
dimensions; it will remain fixed throughout the discussion. We must 
define the function value p(Y, œ) for Y an n-dimensional Borel set and 
we. The definition is given first for Y an interval of the form 


Aji. = {2 <& SA, TAR 
For each n-tuple of rational numbers (24, * ' *, 4,) choose a version of 
the conditional probability P{y,(w) <4, j= 1l, +--+, n |27} which is 


measurable with respect to F. It is easily seen, using CP,-CP, of §8, 
that there is an w set Ae¥ of probability 0 such that, if œ ¢ A, the 
function of (rational) 2,, * - *, A, determined by this conditional proba- 
bility coincides on the set of rational points in n dimensions with some 
distribution function in n dimensions. That is, for œ ¢ A this function 
of rational 44, + * +, A,, is monotone non-decreasing and continuous on the 
right in each variable and so on (see §3). Define 


Pla ap ®) = P{y(o) <A; j=l, ++, | F} wg A, 
A E AN we A, 


for rational ,’s, using the above versions of the conditional probabilities. 
If the 2,’s are not all rational, define 


PA E Aye 0).= lim PAR a: “My? w) 
MA ApJ= Aeon 
(where the pps are rational). Then p(A,, .. 2, @) for fixed w determines 
a distribution function in /,, + * *, An, and thereby determines a probability 


measure of n-dimensional Borel sets Y, Let p(Y, œ) be the measure 
assigned to Y in this way. We conclude the proof of the theorem by 
proving that, for each Y, p(Y, w) defines an œ function measurable with 
respect to F, and that (9,1) is then true, with probability 1. [The 
exceptional set in (9.1) will depend on Y and on the choice of the 
conditional probability.] The assertion is true, by definition of p, if Y 
is an interval A, ...,,. Hence, applying CP, of §8, the assertion is 
true if Y is a right semi-closed interval, finite or infinite, and therefore if 
Y is a finite union of such intervals, Finally, the class of sets Y for which 
the assertion is true includes, applying CP of §8, the limits of monotone 
sequences of sets in the class. Hence the class includes all Borel sets Y, 
as was to be proved. 


§9 CONDITIONAL PROBABILITY DISTRIBUTIONS 31 


We now prove that, under rather wide conditions, there is actually a 
conditional probability distribution of y} - * +, y, relative to F. The 
condition is one on the range R of y, + + +, Y» the n-dimensional set 
(2n-dimensional in the complex case) of points [y,(w),* © -, y,(@)] as w 
ranges through its whole space. 


THEOREM 9.5 Let y,* * *, Yn be random variables, and let F be a Borel 
field of measurable w sets. Then, if the range of y,,* * *, y, is a Borel set, 
there is a conditional distribution of Yyy, ` + +, Y, relative to F, such that 


P(M, œ); for fixed M, defines a function measurable relative to F. 

We give the proof for real y;’s. In the following, F, is as above, and 
q is any probability measure of F, sets, fixed throughout the discussion. 
The fact that there is such a q will appear in the course of the proof. Let 
p be a conditional distribution (wide sense) of y,, + * +, y,, relative to. F 
as in Theorem 9.4, Then, applying (9.1), we find that 
(9.3) 1 = P{Q |F} = Pilyo), + +, yn(w)] e R | F} = pR, w), 
with probability 1. If M eF, and if there are two Borel sets Y, Y, 
such that 

M = {ls(), * > +s ynl)] € Ya} = fly), © + +5 yn()] € Yo} 
then Y, — Y, Y, and ¥,— Y, Y, are subsets of the complement of R. 
Hence 
PY, 0) = pY» 0) = pY Yo), if pCR, w) = 1. 
Thus the following definition is unique: 
(9.4) P(M,o)=p(¥,o), M = {fyn (0), * + y,()] e Y}, 
if p(R, o) = 1. 

This definition makes P(M, wœ) define a probability measure in M for 
fixed w, showing incidentally that such a probability measure in M €. F, 
exists. Finally set 
(9.5) P(M,)=q(M) if p(R,w) <1. 
The function P defined in this way is the desired conditional probability 
distribution. 


The condition of Theorem 9.5, that the range of y}, * * *, Yn be a Borel 
set, is a useful condition. It is always satisfied, for example, if 
Yı ** 's Yn are discrete random variables, that is, if R is a finite or 


enumerable set. (In this case, of course the proof can be simplified to 
the point of triviality.) At the other extreme, it is always satisfied if R 
is the whole n-dimensional space (2n-dimensional space in the complex 
case). This is the most important special case and is especially useful 
because, when the representation theory of §6 is used, the random 


32 INTRODUCTION AND PROBABILITY BACKGROUND I 


variables under discussion all become coordinate variables in a multi- 
dimensional coordinate space, so that R is the whole space. In order to 
make full use of Theorem 9.5 in this connection we now investigate the 
transformation of conditional probabilities and expectations in going 
from random variables to their representations. Since conditional 
probabilities are special cases of conditional expectations, we shall discuss 
only the latter. Suppose that y, x, are random variables, for ż € T, and 
that these random variables are represented by coordinate variables 
gj, č, of a coordinate space. It is supposed that Ef|y|} < œ. If T 
consists of the integers 1, - - -, n we thus represent y, X1, ' * *, %, by the 
coordinate variables j, čą **', &, of (n + 1)-dimensional space, or 
(2n + 2)-dimensional space in the complex case. We have seen that the 
conditional expectation Efy |x, * * *, n} can be taken as a certain Baire 
function ® of x ++ +, ,. The random variable 7 defined on @ space 
has the same distribution as y. Hence we have E{| |} < 00, and we can 
consider the conditional expectation E{7 |%,,---, @,}. Let 


F= Ba E) [F =Z -,%,)) 


be the smallest Borel field of w[@] sets with respect to which the 
x's [čs] are measurable. Now 


[yar = | o@,- -+2,)dP, NEF, 
A A 


by definition of conditional expectation. Then 


faqb=fo@,---%,)aP, heF 

x K 
in view of the properties of the w to & transformation listed in §6, that is, 
(ži + +, n) is one version of E{7 | ži, * - +, %,}. More generally, for 
arbitrary T, each œw random variable corresponds to an @ random variable 
in such a way that corresponding random variables have the same 
integrals in their respective spaces, and that x, corresponds to #,. Then 
if F [F] is the smallest Borel field with respect to which the xps [%,’s] 
are measurable, the F sets correspond to the F sets. The functions 

Efy | x, te T}, Ef | žo t€ T} are determined by the equations 


fy@P=[Elyle,teT}aP,  AcF, 
A A 
Ja cael (G|%,teT}dP, AecF, 


Since the eae of an o function on an set is equal to the @ integral 
of the corresponding @ function on the corresponding @ set, it follows 


§9 CONDITIONAL PROBABILITY DISTRIBUTIONS 33 


that the two conditional expectations correspond to each other in the 
@ @ correspondence. 

If y is a random variable whose expectation exists, and if F is an 
arbitrary Borel field of w sets, we have not yet explained how to apply 
representation theory to the study of E{y | F} unless, as above, is the 
smallest Borel field of w sets with respect to which the random variables 
of some family are measurable. However, the general F is easily 
expressed in this way; the general random variable of the family can be 
taken to be the function which is 1 on an F set and 0 otherwise. 

We exhibit the application of the preceding discussion to the proof of 
an important inequality. Let y be a real random variable, and suppose 
that f is a continuous convex function of a single real variable, defined in 
an interval. Then, according to Jensen’s inequality, 


(9.6) SEY < ELS} 


if the right side exists. Now let F be any Borel field of measurable « 
sets. Then, if conditional expectations behave like ordinary expectations, 


(9.7) SIELy (FY < ELS) F) 


with probability 1, if y and f(y) have expectations. Although this 
inequality can be proved directly from the definition of conditional 
expectations, it is instructive to use the methods we have just developed. 
We suppose in the following that fis defined in an interval / containing 
the range of y. It is then trivial to prove that E{y | 7} will also lie in 7, 
with probability 1. The simplest proof of the desired inequality uses the 
conditional distribution pọ of y, in the wide sense, relative to A. In 
terms of po, (9.7) becomes 


S || apan, o) < f F Oplan, o), 
I I 


for all œ with po(Z, œ) = 1, and this means for almost all w. We have 
thus reduced (9.7) to Jensen’s inequality as applied to the wide sense 
conditional distribution. We also give a proof of (9.7), by means of 
representation theory, to show how in discussions of this type one can 
either use wide sense conditional distributions, as we have just done, or 
true conditional distributions, after applying representation theory. 
Apply representation theory to obtain representations of y and 7 ona 
coordinate space. Then the following are three pairs of corresponding 
functions: 
Ey|F}, Eg |F} 
ELfMIF} ELF} 


FIEF SIE FN. 


34 INTRODUCTION AND PROBABILITY BACKGROUND H 


Since inequalities are preserved in the correspondence, (9.7) is equivalent to 
(9.8) HCARD < ELD) |F} 

(to hold with probability 1). However, since 7 is a coordinate variable 
with range the whole line, the conditional expectations in (9.8) can be 
evaluated as integrals, with respect to conditional distributions, according 
to Theorems 9.2 and 9.5. Thus (9.8) reduces to Jensen’s inequality 
applied to the conditional probability measures. 

This argument has been given in detail to illustrate the fact that, even 
though pathological counterexamples make it impossible to assume that 
conditional probabilities can always be used to define conditional proba- 
bility distributions, and that conditional expectations can then be 
evaluated as ordinary integrals, over œ space, nevertheless, for most 
purposes, conditional probabilities and expectations can be manipulated 
as if these examples did not exist. 

It remains true, however, that some theorems on conditional probabili- 
ties and expectations can be derived just as easily directly as by the use of 
representation theory. This theory in such cases merely makes it obvious 
a priori that the theorems are true. The following theorem is an example 
of this possibility. 

Let f, be a field of w sets, and let Y be the Borel field generated by 
Go Then two measures of Y sets which are identical for Gp sets arè 
identical for Y sets (Supplement, Theorem 2.1), and therefore determine 
the same integrals of functions. This fact is less obvious, but true, for 
conditional probabilities. The formal statement is as follows. 


THEOREM 9.6 Let Fi, Fa be Borel fields of w sets, let G be a field of 


w sets, and let G be the Borel field generated by G. Then, if, whenever 
Me%, 

(9.9) P{M | 74} = PM | 73} 

with probability 1, it follows that whenever y is an w function measurable 
with respect to Y, or equal with probability 1 to such a function, with 
E{{y|} < ©, then 

(9.10) Efy | Fa} = Ely | Fa} 

with probability 1. 

The class of measurable sets M for which (9.9) is true with probability 1 
includes %, and includes limits of monotone sequences of sets in the 
class, by CP; of §8. Hence, according to the Supplement, Theorem 1.2, 
the class includes Y. The class then must include sets differing from F 
sets by sets of probability 0. Finally, (9.10) is now an immediate 
consequence of the evaluation of a conditional expectation in terms of 
conditional probabilities given in Theorem 8.4. 


$10 ITERATED CONDITIONAL EXPECTATIONS 35 


10. Iterated conditional expectations and probabilities 


The following identity is the particular case of (7.6) obtained by taking 
NO: 


(10.1) E{E{y |F} = Efy}. 


In particular, if y(w) is 1 on the w set M, and 0 otherwise, that is, if 
A = Q in (7.7), we obtain 


(10.2) E{P{M |F} = P{M}. 


If F is the field of sets measurable on the sample space of a single random 
variable x, the conditional expectation and probability are relative to v. 
Then (10.1) states that the expected value of a random variable y is 
(roughly) the probability that 2(w) takes on a value, multiplied by the 
expectation of y for 2(c) given that value, summed over those values. 

It is instructive to analyze the meaning of conditional expectations and 
probabilities when the basic distributions are themselves conditional. 
We illustrate the situation by the following example. Suppose that all 
probabilities are conditional, relative to x, and that there is actually a 
conditional distribution, in the sense of the preceding section. It will be 
convenient for the moment to denote the conditional probabilities and 
expectations by P,{—} and E,{—} rather than by P{— |x} and E{—| 2}. 
Now suppose that, for æ(œ) fixed, one is interested in the conditional 
expectation of a random variable y or in the conditional probability of 
an o set M relative to a random variable z, that is, in 


E.fy|2,  P.{M | 4. 


Since y and z are considered given random variables, with given (though 
conditional) distributions, these doubly conditioned quantities need no 
new definitions. It is easy to see, however, that this complicated notation 
is unnecessary, and in fact that we can take 


Efy |2} = Ely | x, 2} 


(10.3) 
P.M |2} = P{M |x, 2}. 


Since we are not attempting to give a rigorous discussion here, we shall 
not formalize this assertion, but shall verify the second equation in the 
particular case when Q is three-dimensional space, the measurable w sets 
are the Lebesgue measurable sets, the given probability measure is 
determined by a density function f, and x, y, z are the coordinate functions 
of Q. Then the joint distribution of x, y, z has density f. Finally, we 
suppose that M is an œ set of the form {z(w) € Z}, where Z is a Lebesgue 


36 INTRODUCTION AND PROBABILITY BACKGROUND I 


measurable set. In this case one version of the bivariate conditional 
distribution of y, z for x(w) = & has density 


geen) oe 
| [Aer Oar ae 


Considering this density as defining a y, z distribution, the conditional 
distribution of y for z(w) = ¢ has density 


gn, 9) 


(10.4) = 
fenia 


On the other hand, the conditional distribution of y for x(w) = & and 
2(m) = € has density 
(10.5) ie, m) 


[IEn D a 


The equality of (10.4) and (10.5) is exactly the second equation in (10.3) 
expressed in terms of densities. 

The relations between probabilities and expectations obey a certain 
consistency law. Suppose that all expectations, and probabilities are 
conditional, relative to a conditioning variable x,. Then the rules of 
combination of 


E,{-} = EC | a}, ES | xa} = EL | ay, ta} 
P,{—} = P{— | a}, P,{— | wo} = P{— | ay, ta} 
must be, respectively, the same as those of 
Ei}, Et |}, 
P{-} PL | 25}. 
If this were not so, the knowledge that certain parameters in a problem 
may really be random variables would entirely change the formal handling 
of the problem. 
This consistency principle suggests the following equations. Corre- 
sponding to (10.1) and (10.2) we have 
(10.6) E|E{y | x, za} |21} = Efy |} 
and 
(10.7) E[P(M | 2, a3} 21] = PM |x} 


$11 CHARACTERISTIC FUNCTIONS 37 


respectively, with probability 1. These equations reduce to (10.1) and 
(10.2) if the conditioning variable x, is ignored. The equations are to 
be interpreted as follows. If 2, %,y are random variables, with 
E{|y|} < co, and if M is a measurable œ set, then, no matter which 
versions of the indicated conditional probabilities and expectations are 
used, the equations are true with probability 1. This is another way of 
saying that, if the proper versions are used, the equations are true for all w. 
Equation (10.7) is a special case of (10.6). Both are almost immediate 
consequences of the definitions of conditional probabilities and expecta- 
tions. Before proving them we generalize them as follows. Let Fy, Fa 
be Borel fields of measurable w sets with F, C.F} Then 


(10.8) ElE(y |F |F) = Ey 17) 
and i 
(10.9) E(P(M FAF) = PM | F,} 


with probability 1. Evidently (10.6) and (10.7) are special cases of 
(10.8) and (10.9), respectively. To prove (10.8), which contains them all, 
we need only remark that (aside from the measurability conditions, which 
are obviously satisfied) it states that 


[Ew |F} dP = |ydP, Ach, 
A A 


and this equation is true, by definition of the integrand on the left, even 
for Ae Fy) F, There are many useful special cases of (10.8) and 
(10.9). For example, if y, 2, ə ** + are random variables, with 
E{|y|} < 00, then according to (10.8), 


EfE{y | %1, ta: $ Jle» Ty’ * ‘| = Efy | ta ta" °} 
with probability 1. 


11. Characteristic functions 

If x is any random variable, with distribution function F, its character- 
istic function is defined by A 
(11.1) D(t) = Efe} = | el dF(A). 


The function ® is uniquely determined by F and is accordingly also 

called the characteristic function of this distribution function. The 

following fundamental properties of characteristic functions will be used. 
(A) © is continuous for all ¢, and 


(11.2) : Eo] < ©(0) = 1. 


38 INTRODUCTION AND PROBABILITY BACKGROUND I 
If E{||"} < co for some positive integer n, ® has n continuous derivatives, 


and © 
DM) = f 0" e”? dF) 


a) 


(11.3) 5 
JEO] < | |A|" dF, 
so that a 
a=} PE{a!}0) i — sy 
aA ®(t) = 2; ‘i í D oE =i 


2 PEI n- 
a i jt from omo” a = ds 


Rl Ef Efla\" el” 
a EEK ody, joai < TERM 


ac > er ng + ol(tl"). 


j=0 
If 0< 6 < 1 and if E{|x|"*°} < 00, we have 


i ae he 


(11.4) Di) = > 


j=0 


ate O(|t|"*), 
olit) < a rs 
~ (1+ 6)(2 + 6)- > «(n+ 6) 


To see this we need only find a majorant for the integral in the second 
line ee 4), and in fact 


UTi i -ayn( pit 
“ale aR Ae (e — 1) dF) 
ae 
<)a K oD! ds fe 3 1a|n|sa|° LFO) 


_ meted 
(1+ 62+ 4)+ +++ 6) 


$11 CHARACTERISTIC FUNCTIONS 39 


(B) F is determined by ® in the sense that (Lévy’s formula) 


Fu) + Fu) FA) + FG-)_ ferent 
11.5 — I 
(11.5) 2 2 im f Sae T A 
Ze 
(C) If z * +, &„ are mutually independent random variables, with 
characteristic functions ®,, - - + ®,, the characteristic function ® of 


+++ ++ ea, is 
=| |, 
j=l 


(D) If Fẹ Fy, >+- is a sequence of distribution functions, with 
lim F,(A) = F(A) at all continuity points of a monotone function F, 


no 


and if F is a distribution function, the sequence of characteristic functions 
of the F;s converges to the characteristic function of F uniformly in 
every finite interval. 

We shall have occasion to use the related theorem that, if the functions 
F,, F, are only supposed monotone non-decreasing and bounded, with 
lim F,(4) = F(A) at all continuity points of a bounded monotone 


n> 


function F, and if g is continuous and bounded; then 


(11.6) lim | g(A) dF) = J gC) dF) 


if 
F(œ0)— F(— ©) = lim [F,(c0)— F,(— %)]. 


The latter condition is equivalent to the condition that 


lim F,(A) = F,(+ 0) 
dio 
uniformly in n. If in addition g(A) = g(t, A) depends on a parameter t 
and if g(t, A) is bounded and continuous in (¢, 2), then the convergence in 
(11.6) is uniform for ¢ in every finite interval. The theorem on charac- 
teristic functions stated in the preceding paragraph is a particular case 
of this theorem. 

(E) Conversely, if F}, Fz, * * * is a sequence of distribution functions 
whose corresponding sequence of characteristic functions ®,, M., >- 
converges to a characteristic function ®, and if F is the distribution 
function corresponding to ®, then lim F,(A) = F(A) at every continuity 


n= 


_ point of F. If it is only supposed that lim ®,(r) exists for all ¢, the 


n>% 


40 INTRODUCTION AND PROBABILITY BACKGROUND I 


limit function will necessarily be a characteristic function if the limit is 
uniform in some interval containing t = 0. 

The following consequence of (D) and (E) will be used frequently. If 
Xy, %,* * + is a sequence of random variables, with characteristic functions 
®,, ®,,-- +, plim z, =0 if and only if lim ®,(¢) = 1 uniformly in 


n> oO no 
every finite f interval. In fact, if F(A) is defined as 0 for 4< 0 and 
1 for A> 0, p lim x, = 0 if and only if 


n> 0 


(11.7) lim P{x,(w) <4} = F2), 240, 


and the condition for (11.7) in terms of characteristic functions given by 
(D) and (E) is precisely the stated one. As a matter of fact, if 
lim ©,(t) = 1 uniformly in some interval containing t = 0, the same will 


noo 


be true in every finite interval, because of the following inequality in the 

real part of a characteristic function ©: 

sin? ta 
2 


RE — D) = | [1— cos tå] dF(A) > | dF(A) 


= IR[1— 0(2/)]. 


The following inequalities will be used frequently. They illustrate the 
simple way in which the characteristic function of a distribution limits 
the probabilities of large values, and show the simplicity gained by 
centering the distribution at a median value or at a truncated expectation. 
Throughout the following discussion, v is a random variable with distri- 
bution function F and characteristic function ; a, «, u are positive 
numbers, and A is a Lebesgue measurable subset of the ¢ interval [0, a], 
of measure p >0. The function L; is a positive function of whatever 
variables are indicated, but not depending, for example, on the distribution 
function F except by way of the indicated variables. Let m be a median 
value of x, 

Pixo) <m}>4, Pizo) > m} >}, 


define a’ as x truncated to m when it is beyond m + «, 


x'(w) = a(o), |2(@)— m| < a 


=m, |2(@)— m| > a, 
and define 
m= Efx}, 6 = Efx — A}, 


O(t) = Efe ®, 


S11 CHARACTERISTIC FUNCTIONS 41 


Then we shall prove that there is an L,(u, p, a) such that 
(11.8) P{lx(o)| > u} < Lilu p, a) | R — DH] ar. 
A 


Centering at m will yield the more useful 
OLS)  P{je(o)— m| > u} < 2L(u, p, a) | (1 [C/I dt 
A 


<—4L(u, p, a) | log || dr. 


Centering at m will yield an inequality of the same type, 


LS) P{le(co)— ñ| > u} < Lala, u, p,a) | U— [DO] dr 
A 
< — 2L4(a, u, p, a) | log [O)| at. 
A 


Proceeding to study the second moments, we shall prove 
u 
(11.9) | 72 dF(A) < Lalu, p, d) Í RII — D(4)] dt. 
=s A 
If we denote the variance of the F distribution truncated to m when it is 
beyond m + u by 6(u)*, 
m+ 


(uw)? fa m)2 dF(A) (je m) dF)’, 


m— bh m—"K— 
then 6(a)? = 6%, and (11.9) will yield, for some Lilu, ps a), 
a1.) P? < Lilu p, a) | [1— [OCP] de 
A 


<—2L(y, p, a) | log |®(0)| dt. 
A 


Finally, we shall prove that, for some L,(7, «, p, a), 


(110) [1 O| < LT, % p, a) f 1 (OO dt 
A 
<—2L(T, o p, a) | log [OO] at, |t| <T. 
A 


The following elementary inequalities will be needed in the proofs of 
the above inequalities. In the first place the function 
s 
Si— Sins E 


n 


42 INTRODUCTION AND PROBABILITY BACKGROUND I 


vanishes at s = 0, 7, is positive for small s, and its derivative vanishes at 
a single point between 0 and m. Hence 
(11.11) s— sins >—, 0<s<z. 
a 
In the second place, if A > 0, 
P 

s 1-cos At) dt = ————_.. 
(11.12) f cost) dt > Se 
To prove this let a, be the smallest number > a for which Aa,/27 is an 


integer. Then om 
a<a+ F 


and A C [0, a,]. We minimize the integral in (11.12) by replacing A by 
Aa,/7 non-overlapping intervals in [0, a,], each of length wp/a,A, each with 
an endpoint at a point t = 0, 27/A,- + +, a, which makes the integrand 0: 


mplayr 


fa — cos À t) dt > al | (l1— cos å t) dt = p a sin(7p/a;). 
mT T 
A 


Then by (11.11) 

fa- cosina © p 

J G+ dal 
proving (11.12). 

To prove (11.8) integrate the inequality 
Í (1 — cos åt) dF (4) < RII — 0(0)] 

len 

over A to obtain 
Í dF(2) j (1 — cos At) dF(A) < f RE — &()] dt. 
Alen A A 

This inequality implies (11.8) with 


2 
Li(u, p, a) = See aL 


in view of (11.12). 
To prove (11.8’) apply (11.8) to «— at, where x* is independent of x 
and has the same distribution as x. We obtain 


P{lx(w)— 2*(o)| > u} < Lilu, p, a) | (1— |@@)"1 dt, 
A 


§11 CHARACTERISTIC FUNCTIONS 43 


since x— x* has the characteristic function |®|?, Now 
P{|x(w) — x*(w)| > u} > P{x(w)— m > u, x*(w)— m < 0} 
+ P{a(o)— m <— u, «*(w)— m >= 0} 

> $P{x(w)— m > u} + $P{a(o)— m <— u} 

= $P{|2(o)— m| > u), 
and this inequality combined with the preceding one yields the first half 
of (11.8’). The other half follows from the inequality 

1—a <— loga, O0<a<l. 


We defer the proof of (11.8) until that of (11.10) has been given. To 
prove (11.9) we use (11.12) to obtain 


Puas a> fara faos dt 


Sa 


= J Gs a ae), 


2 dF(A), 
| sh 


so that (11.9) is true with 
(au + 27)? 
Let paa) = 


To prove (11,9), (11.9) with 2y instead of w is applied to x— a* to 
obtain 

2u 

J PaP{e(o)—x*(@) <3 
= (e— xt) < £2, p, a) J [1 |O] dr. 

(tte 2M0)1 Ba) 
If we combine this inequality with 
(@— 2*)? dP > Í (@— at)? dP 
{[e(w)—2*(o)| <24) os wi A 
= 2P{|z(o)— m| < a} i (A— m)? F(A) — aly T (a— m) dF(a))” 


m-n- 


> 26u)? — 2/2P{|2(o)— m| > u), 


44 INTRODUCTION AND PROBABILITY BACKGROUND 
we obtain 
Gu)? < peP{|x(o)— m| > u} + Hu, p, a) | E (O0) at. 
This implies (11.9) with , 
Lilu, p, a) = 2pPLy(u, p, a) + $L, p, a) 


Xau + 27)? _ Rau + 27)? 
P n 2p? 


= A(ay z 2r)? 
P 
To prove (11.10) we note that (using the definition of 7m) 


m+n 
i| (0-0 1) AFA) + 2Pf|a(w) — m| > a} 


m-a- 


a113) |I— &(9| < 


ii [(e*-™ — 1 — i(A— m)] dF(A) | 


m-a- 


+ 2P{|a(w)— m| > a} 


< 


+ |t(ra— m)|P{|x(w) — m| > a} 
m+n 
<$ | o- maro 
+ [2 + |i(— m)|P{|e(w)— m| > a} 
A 
= Ë 2+ h m| E h m 
“P{je(@)— m| > a} 
< a 4 ` Pilato) — m| > a}. 
Hence (11.10) is true with 


T? 
LT, æ, p, a) = z7 L,(a, p, a) + 5L,(%, p, a) 


2 (a+ roa + 5) 


$11 CHARACTERISTIC FUNCTIONS 45 
Finally, we prove (11.8) by applying (11.8) to x— m, obtaining 
P{la(o) — ñ| > u} < Lu, p,a) | RI DO] de 
A 


< Lilu, p,a) | |1— (| at, 
A 
so that (1 1.8) is true, with 


L(x, 4, p, a) = aL,(u, p, a)L;(a, a, p, a) 
2. 2 2n\2 
<ap* (a : =) (a+ a) (2a? + 5). 


If æ is a bounded random variable, (11.9) yields a majorant of the 
variance of x, if « is sufficiently large, but it is more enlightening to make 
a direct analysis. We shall prove that if || < M, and if x has variance 
o*, and characteristic function ®, then 

1 
(11.14) — log |®()| < t <— 3 log |O()|, lel < a 
To prove this, suppose first that E{x} = 0. Then by (11.4) with n = 2, 


212 
(11.15) j1— <5 


and by (11.4’) with n = 2, ô = 1, 
t| _ Mot? 
as a | ea 
1— (t) alle or 


(11.16) 


Now, if z is a complex number, and if |1— 2| <1, 


EE A 1) a <E 


(integrating along a line segment). Hence, using (11.4), 
|1— oP ott 


z|? 


a MPE 
llog? +1— 2| < o S aa er 
so that, combining this inequality with (11.16), 
242! attt Mt ot 
| lgð— T + MIt| < 1/2. 


| 2 | <a oD) 6h 6h 
Taking real parts, we find that (11.14) is true for UES 1/2M. If we now 
drop the restriction that Efx} = 0, we apply the inequality to x— Efx} 
to obtain (11.14) as stated. 


CHAPTER II 


Definition of 
a Stochastic Process— 
Principal Classes 


1. Definition of a stochastic process 


From the non-mathematical point of view a stochastic process is any 
probability process, that is, any process running along in time and con- 
trolled by probabilistic laws. Numerical observations made as the 
process continues indicate its evolution. With this background to guide 
us we define a stochastic process as any family of random variables {x,t T}. 
Here x, is in practice the observation at time t, and T is the time range 
involved. 

Most classical problems in probability involve only finitely many 
random variables, that is, the corresponding t range T is a finite set. In 
the following chapters, however, we shall almost always consider processes 
involving infinitely many random variables, and the term stochastic 
process has usually been applied only in this case. The two most 
important cases are the following: 

(a) Tis an infinite sequence; {2,, t e T} becomes 


Ems myo ` * * 
or 
t ts Cm- Lm 
or 
E 


This type of process is called a discrete parameter process. 

(b) T is an interval; {x t e T} becomes a continuous parameter family, 
and the process is called a continuous parameter process. In this case 
the general sample is a function of t defined on an interval, whereas in 
the discrete parameter case the general sample is a sequence. More 
generally we shall call any process whose parameter set is finite or 
enumerable a discrete parameter process, and any process with a non- 
denumerable parameter set a continuous parameter process. 

46 


$1 DEFINITION OF A STOCHASTIC PROCESS 47 


Our definition of a stochastic process is historically conditioned and 
has obvious defects. In the first place there is no mathematical reason 
for restricting T to be a set of real numbers, and in fact interesting work 
has already been done in other cases. (Of course, the interpretation of 
t as time must then be dropped.) In the second place there is no mathe- 
matical reason for restricting the value assumed by the 2,’s to be numbers. 
However, in this book we shall consider only processes with T a linear 
point set, and the random variables will almost always be real- or 
complex-valued. We allow the defining real random variables of 
processes to have the values +00 and —oo, but only with zero proba- 
bility. 

Our definition of a stochastic process is embarrassingly inclusive in 
that there are not many problems in probability that cannot be formulated 
as problems in families of random variables. Historically, however, the 
term stochastic process has been reserved for families (usually infinite) of 
random variables with some simple relationship between the variables, 
and this book is devoted to the most important examples of such families. 
A basic problem is to devise suitable relationships, that is, to discover 
new types of stochastic processes which are useful, or mathematically 
elegant, or which conform otherwise to the investigator’s criterion of 
importance. 

Let {x,, t eT} be a stochastic process. A function of t eT obtained 
by fixing w in x(w) and letting ¢ vary is called a sample function of the 
process. (If T is finite or denumerably infinite, the sample functions are 
of course sample sequences.) 


Let f, °° *, t, be any finite set of parameter values of the process 
{£p teT}. The multivariate distribution of the random variables 
Xa" * *, &,, is called a finite dimensional distribution'of the process. The 


finite dimensional distributions of a process are the basic distributions 
of the processes we shall consider in this book, and the processes 
we shall consider are classified according to their finite dimensional 
distributions. 

Let Fp = Ala, te T) be the Borel field of w sets generated by the 
class of sets of the form {a,(w) € A}, where t e T, and A is any Borel set 
(one-dimensional if the 2's are real, two-dimensional if they are complex). 
Then F p is the smallest Borel field of œ sets with respect to which all 
the a,’s are measurable. The Borel sets A in the definition can be 
supposed in a somewhat more restricted class, without changing F p. 
For example, the A’s can be taken as right semi-closed intervals. 

Let Fy be the field of w sets of the form 


{[xz,(o), «+ + %,(@)] € A} 


48 DEFINITION OF A STOCHASTIC PROCESS 1 


where (t,,* * *, t,) is any finite parameter set and A is a right semi-closed 
interval (n-dimensional if the x,’s are real, 2n-dimensional in the complex 
case). Then Zp is the Borel field generated by Fo. According to 
Theorem 2.3 of the Supplement, if A is an œ set which is a measurable set 
on the sample space of the «,’s, (see I §7 for the definition of this concept) 
and if e > 0, there is an -F, set A, such that 


P{A(Q— A) U(Q— AJA} < e. 


According to the same theorem, if æ is a random variable measurable on 
the sample space of the «,’s, and if e > 0, there is an œ function x, which 
takes on only finitely many values, each on an -F set, such that 


Pf|a(w)— 2,(w)| > £} < e. 


The given probability measure is also a measure of F p sets. Let Fr 
be the domain of definition of this restricted measure after completion 
(see Supplement, §2), that is, F y’ consists of the F p sets and also of the 
sets which differ from F p sets by sets which are subsets of Fp sets of 
probability 0. All the Zy’ sets need not be measurable if the given 
probability measure is not complete, but in any event there is only one 
way in which their probabilities can be defined which is compatible with 
the finite dimensional distributions. In fact the finite dimensional 
distributions furnish the probabilities of Fp sets, the probability measure 
of F, sets can be extended in only one way to Fy sets (Supplement, 
Theorem 2.1), and the probability measure of Fp sets can be completed 
by extension to F p sets in only one way. 

If w sets not in Fy’ are measurable, that is, have probabilities assigned 
to them, these probabilities are to a certain extent incidental. Some 
additional principle is necessary if these probabilities are to be determined 
in terms of the basic probabilities of A, sets, that is, in terms of the 
finite dimensional distributions. Before investigating this question 
systematically, we illustrate it by an almost trivial example. Let {2 be 
the semi-closed interval (0, 1]. The measurable sets are to be any unions 
of the six semi-closed intervals (= 
probability measure is to be length. The stochastic process is to contain 
only a single random variable v, defined as j on the jth semi-closed interval 
above. This random variable provides a mathematical model for the 
result of tossing a balanced die once. The fields Fp and F, are the 
same and consist of all the measurable sets. It is obvious that the 
measures of other sets can be defined in many ways which are compatible 
with the measures already assigned. For example, the set containing the 
single point ¢ is not now measurable. It can be assigned the probability 


PE 1,+ ++, 6. The given 


$1 DEFINITION OF A STOCHASTIC PROCESS 49 


4, which means that the rest of the interval (}, 1] must be assigned the 
probability 0. On the other hand, an entirely different assignment of 
measures compatible with the given ones would be the assignment, to 
every Lebesgue measurable subset of (0, 1], of its Lebesgue measure. 
With this assignment, the set containing the single point + is measurable, 
but has probability 0. The distribution of the random variable a is 
unaffected by these new assignments of probabilities. 

We now investigate these measure questions systematically, using the 
representations of families of random variables discussed in §6 of I. 
Suppose that {z,, t e T} is a stochastic process. Assume that the process 
is real; the obvious modifications are made in the complex case. Let 
© be a real function of t eT; the function values + œ% are admissible. 
Let Õ be the space of all points &, that is, of all functions of r e T. Then 
© is a coordinate space whose dimensionality is the cardinal number of T. 
Let #,(@) be the sth coordinate of &, that is, the value of the function @ of 
t when t=s. Then #, is a representation of x, based on function space. 

If T is finite, containing say the n points 4, ** ', fm © becomes the 
n-dimensional space of points 


[E > (ty), 


and a measure is defined on the Borel sets of this space by assigning to 
the point set 


(1.1) (EED, © © + Et € A} = (EO) > (1 € A} 
the probability 
(1.2) Plx, (0), + +» %,(@)] € 4}, 


for every n-dimensional Borel set A. Consistently with our previous 
notation we denote by ¥ 7 the field of n-dimensional Borel sets. Com- 
pleting the measure of Borel sets obtained in this way, we obtain an n- 
dimensional Lebesgue-Stieltjes measure. 

If T is infinite, Q becomes infinite dimensional Cartesian space. Let 
F n = B&,, t eT) be the Borel field generated by the class of @ sets of 
the form (1.1). Then a measure of F p sets is obtained by assigning the 
probability (1.2) to the “finite dimensional” @ set (1.1), and more generally 
every Ž p set A is assigned as probability the probability of the set A 
which determines sample functions in A, that is, the set A such that if 
w e A, then w() defines a function of ¢ which is an element of A. (See 
Supplement, Example 3.2, and I §6.) For each t eT the & function %, 
is now a random variable, and for every finite £ set fi, *' `, ty the & 
random variables #,,, °°" Žan have the same multivariate distribution as 
the w random variables t, * * *, %,- The family {%,, t e T} was called 


50 DEFINITION OF A STOCHASTIC PROCESS i 


a representation of the family {x,, t e T} in I §6. Every family of random 
variables thus has a representation whose basic space is function space. 

On the other hand we have seen in I §5 that, if T is any aggregate and 
© the space of all functions of tf e T, it is possible to define a probability 
measure on the Borel field ¥ » by assigning (finite dimensional) measures 
to @ sets of the form (1.1) in a consistent way, and then (if T is infinite) 
extending the definition to the remaining sets of ¥ p. Thus an @ measure 
of F p sets can be defined either as the one induced by a given family of 
random variables, or as one induced by a given (consistent) family of 
finite dimensional probability distributions. In either case we finally 
obtain a family of random variables {#,, t € T}. 

The advantage of dealing with the family {%,, t¢ T} is that we have 
complete control over the class of measurable sets. We have already 
remarked that for a general family of random variables {x,, t € T} proba- 
bilities of sets not in F p’ may be defined, but if defined these probabilities 
are to a certain extent arbitrary, not solely determined by the probabilities 
of Fy sets and the properties of measure in general. For certain purposes 
to be discussed in §2, it is necessary to define probabilities of sets not in 
F p, and these must be defined in a specific way to obtain a fruitful 
theory. Since they may already be defined in some other way, the theory 
Tuns into real difficulties here. The two best ways of overcoming the 
difficulty seem to be the following: (a) one can modify the random 
variables of a given family in such a way that the finite dimensional 
distributions are unaltered, and that the sets whose measurability is 
desired turn out to be in ¥ 7’; (b) one can base the theory on the space 
Q and field Fy’ of measurable sets, defining probabilities of sets not in 
F y in any way desired that is compatible with the general properties of 
measure, The first method will be adopted in this book in most of the 
discussion. Both possibilities will be discussed in §2. 


2. The scope of probability measure 


Let {x,,, n > 1} be a stochastic process. The arithmetical operations 
performed on the x,„’s and those involved in finding bounds and limits of 
wx,’s always lead to functions of œ which are random variables, that is, 
which are measurable. For example, L.U.B. |x,| is a random variable, 
because the set equality ? 


(2.1) {L.U.B. |x,(o)| > 3} = Õ {leo| > 3} 
n vt 
shows that the set on the left is measurable, as the union of a sequence 


of measurable sets. (The usual conventions are made if the least upper 
bound is infinite.) 


§2 THE SCOPE OF PROBABILITY MEASURE 51 


The situation is more complicated, however, if one deals with a non- 
denumerable family, say {x,, t e J}, where / is an interval. In this case 
(2.1) becomes 


2.1) (LUB. jeo) > A = U {la (w)| > 2}. 
te tel 


Since the union on the right involves non-denumerably many sets, the 
equality does not imply the measurability of the set on the left, and in 
fact it is not measurable in general. Continuing this investigation, we 
find that the probabilities that the sample functions are bounded, are 
continuous, are measurable, are integrable, and so on, may not be defined 
because the corresponding œ sets may not be measurable. What is worse 
is that even if these sets are measurable their measures are not uniquely 
determined by the finite dimensional distributions, and in fact for given 
finite dimensional distributions there may be considerable latitude in the 
measures of these sets. This means that the choice that happens to have 
been made in the original definition of œ measure may not be the one 
which leads to a fruitful theory. 

As an example consider the following very simple case. The parameter 
set T is the infinite line, — 00 < t < 00. The finite dimensional distribu- 
tions are determined by 


P{x,(w) = 0} = |, —o<t< 0. 


Then, if S is any finite or enumerably infinite set (and only in that case), 
it follows that 
P{x,(m) = 0, te S}= 1. 
It is desirable for many purposes to assert that 
P{x,(w)=0,—o<t< Olea, 


However, if M is the set determined by the condition in the brace, 
examples will be given at the end of this section in which M is not measur- 
able, in which M is measurable and P{M} = 1, in which it is measurable 
and P{M} = 0. The middle case is of course the desirable one, but the 
others cannot be excluded without introducing some new criterion. The 
criterion of separability introduced in the next paragraph appears to be 
the simplest appropriate criterion. 

Let {x t e T} be a real stochastic process with linear parameter set ia 
Let ./ be a system of linear Borel sets. The process will be called 
separable relative to 2% if there is a sequence {t,} of parameter values and 
an w set A of probability 0 such that, if A e.%, and if 7 is any open 
interval, the œ sets 


{x (w) € A, te IT}, {x,,(@) eA, t;e IT} 


52 DEFINITION OF A STOCHASTIC PROCESS 1 


differ by at most a subset of A. The second of these two œw sets is 
obviously a measurable œ set which contains the first. The first set is 
then necessarily also measurable, under the separability hypothesis. The 
first set is not necessarily measurable in general unless /T is at most 
denumerable. The two most important special cases are: 

(i) £ is the class 7, of all (finite or infinite) closed intervals. This is 
the smallest class .«/ for which the concept of separability is useful, and 
we shall accordingly write separable instead of separable relative to /,. 

(ii) 7 is the class of all closed sets. 

Obviously separability relative to a class of sets implies separability 
relative to any smaller class. For the purposes of this book we shall 
need only separability relative to the class .«/, of closed intervals, but for 
some questions separability relative to a larger class is required. Examples 
will be given below. In the following we shall call a sequence {r,} a 
sequence satisfying the conditions of the separability definition (relative to 
a specified class sZ, or relative to -e%, if none is specified), if there is an 
@ set A which in conjunction with the sequence has the properties stated 
in the separability definition. 

It is obvious how the concept of separability is extended to abstract- 
valued random variables. In particular, for complex-valued random 
variables, the above definition need be changed only by making the sets 
of the class .e/ two-dimensional Borel sets. We shall say that the process 
is separable if the processes determined by the real and imaginary parts 
of the process random variables are separable (relative to 7). In other 
words, in the complex case the minimal class corresponding to .%, in the 
real case is the class of closed rectangles with sides parallel to the 
coordinate axes. 

Going back to the case of real valued random variables: according to 
the definition of separability relative to the class of closed intervals, if A 
has the properties stated in the definition, and if œ ¢ A, then 


G.L.B. %,(@) = Gi L.B. x, («), 


(2.2) teIT t; eIT 
L.U.B. x(w) = L.U.B. x, (w), 
tel? tj eIT 


for every open interval 7. Conversely, if there is an œ set A with 
P{A} = 0, such that if w ¢ A it follows that (2.2) is true for every open 
interval J, then the æ, process is obviously separable. Note that for all w 


G.L.B, BAe) = OU, v (o), 
teIT 


L.U.B. 2,(w) > L.U.B. x 4); 


telT tjeIT 


§2 THE SCOPE OF PROBABILITY MEASURE 53 


so that if (2.2) is true it will remain true if more values of ¢ are added to 
the sequence {t;}. Hence the content of the statement is that for almost 
all œ (2.2) is true for a sufficiently large denumerable set {t,;}. We can 
evidently replace (2.2) by 

(2.2) lim GLB. x,(w) < 2m) < lim L.U.B. x,,(@), teT. 


no |t;-t]<I/n na |ty-t]<1/n 
Separability implies that if J is an open interval 


L.U.B. 2(@), G.L.B. x(w), lim sup x,(), lim inf x(w) 
telT teIT ter tor 
are all (finite- or infinite-valued) random variables, that is, measurable 
functions, since the right sides of (2.2) are random variables. In con- 
nection with an example discussed above, we remark that if for each 
parameter value t 
Pfr (o) = 0} = 1, 


then, if {¢,} is any sequence of parameter values, 
Piz (w) = 0,7 > 1} =1. 


In particular, if the process is separable, and if {t,} is a sequence satisfying 
the conditions of the separability definition, the preceding equation 
implies that 

P{x(w) = 0, te T} = 1. 


We have already remarked that this conclusion is not correct without 
some hypothesis like separability. 

Before discussing the existence of separable processes, we prove three 
theorems for later reference. 

THEOREM 2.1 Let {x,, te T} be a real stochastic process. 

(i) If there is a sequence {t;} of parameter values for which (2.2) is true 
unless œ e A(1), where P{A()} = 0, for every I, then the x, process is 
separable, and in fact the sequence {t,} satisfies the conditions of the 
separability definition. 

(ii) If the x, process is separable, and if there is a sequence {t;} of parameter 
values for which (2.2') is true unless w e A, where P{A,} = 0 for every 
teT, then the sequence {t;} satisfies the conditions of the separability 
definition. 

To prove (i) define M = A(Z,), where the Zs are the intervals with 


rational endpoints. Then P{M} = 0 and (2.2) is true if  ¢ M for every 
open interval Z. The sequence {f;} thus satisfies the conditions of the 


separability definition. To prove (ii) let {r;} be a sequence of parameter 
values satisfying the conditions of the separability definition, so that 


54 DEFINITION OF A STOCHASTIC PROCESS 1 


(2.2) is true with the 7;s unless œ « N, where P{N} = 0. Then, unless 


weNUUVA,, 
J 
G.L.B. Ly, (w) < G.L.B. x, (0) = G.L.B. x,(@) 
tjeIT 7 eIT teIT 


L.U.B. x, (w) > L.U.B. x (©) = L.U.B. x(o). 
t; eIT ryeIT tel? 
The inequalities must be equalities since the third term in the first (second) 
line is certainly not larger (smaller) than the first term. Hence the 
sequence {rp} satisfies the conditions of the separability definition. 
THEOREM 2.2 Let {x, teT} be a separable stochastic process, and 
suppose that, for every 7 «T, 
plim a, = z, 
tor 
(i) If {tj} is any sequence of parameter values dense in T, this sequence 
satisfies the conditions of the separability definition. 
(ii) Let T: [a, b] be any finite closed interval containing points of T, and 
suppose that, for each n, a <5)" <+ ++ < s, ™ <b, with s;™ eT, 
lim L.U.B. Min |¢— s;™] = 0, 
n>o telT jSa, 
(that is, the s,s become dense in [a, b]T when n—> œ). Then it follows 
that, if the process is real, 


lim Min Xe, in(@) = G.L.B. x(w) 


n>o j teIT 
lim yax £, (©) = L.U.B. x(w) 
n> 0 telT 


with probability 1. 


The truth of (i) is an obvious consequence of Theorem 2.1 (ii). We 
prove the second of the two limit relations in (ii) in the real case. Let 
{t;} be a sequence of parameter values satisfying the conditions of the 
separability definition. We can suppose that, if a or b is in T, the endpoint 
is also a ¢;. 

Fix the parameter value t¢[a, b], and for each n choose j in such a 
way that s, = s;™ —> t(n-> œ). Then by hypothesis 

x, = plim x, 
Hence there is probability. 1 convergence for some subsequence of values 
of n. This fact implies the truth with probability 1 of the far weaker 
inequality Ua 
x(w) < lim inf Max 2, m (w). 


n> k<a, 


§2 THE SCOPE OF PROBABILITY MEASURE 55 


But then, letting ¢ run through the points of any sequence {#,} satisfying 
the conditions of the separability definition, 
L.U.B. x(w) = L.U.B. x, (w) < lim inf Max z, (n(@) 
tell teIT no kSay 

with probability 1. Since the reverse inequality is obvious, we have now 
derived the desired result. 

In the following we shall frequently find it convenient to simplify the 
typography by using the notation x(t, œ) and x(t) instead of x(w) and x, 

THEOREM 2.3 Let {a(t), te T} be a real separable stochastic process, 
and suppose that r is a limit point of the t set T{t> 7}. There is then a 
sequence {r,} in T such that 


Tyo tye cee cy Tn Nats 


lim sup x(7,,) = lim sup a(t) 
thr 


lim inf x(7,,) = lim inf z(t), 
no thr 
with probability 1. 
This theorem implies for example that, if lim æ(s„) exists with 


na 
probability 1 whenever s, | 7, then, even if the exceptional œ set may 


depend on the sequence {s,} in the hypotheses, nevertheless lim w(t) 
tr 


exists with probability 1, that is, almost all sample functions have limits 
when 1} 7. In proving the theorem we shall suppose that the superior 
and inferior limits involved are finite; if this is not true initially, it will 
be true if 2(r) is replaced by arctan a(r). Now choose fı, fy, * * * to 
satisfy the conditions of the separability definition, so that (2.2) is satisfied 
with probability 1, for all open intervals J. Then for each n choose a 
finite number of f;’s, say s4™, s9'"), + + +, to satisfy 


1 
a <a 


1 
P| L.U.B. x(t, w)— Max x(s;™, w) > - 1 < > 


t<t<t+l/n 


P| G.L.B. x(t, w)— oe a(s;,, 0) <— i<: <-. 


r<t<t+1/n 
If {r„} is defined as the s;"”’s Eee into a monotone sequence, T, | 7 and 
p lim [L.U.B. x(t) — L.U.B. a(7,)] = 0 


n>% r<t<r+1/n qy<r+lin 


p lim [G.L.B. a(t)— G.L.B. x(7;)] = 0. 


n> r<t<r+l/n 1j<t+1/n 


56 DEFINITION OF A STOCHASTIC PROCESS i 


These limit equations imply the truth of the theorem, since the least upper 
and greatest lower bounds involved converge to the corresponding 
superior and inferior limits. 

The preceding two theorems show the fundamental importance of the 
concept of separability of a process. It has not yet been explained how 
much of a restriction it is on a process {x,t e T} to suppose that it is 
separable. Obviously separability is no restriction at all if the parameter 
set is denumerable, because in that case the parameter set is itself a 
sequence {t;} which satisfies the conditions of the separability condition 
(relative to every class ./). The following lemma is fundamental. Note 
that the proof does not use our standard assumption that the parameter 
space T is linear. The following discussion can thus be generalized to 
abstract parameter sets as well as to abstract-valued random variables. 

LEMMA 2.1 Let {a,, t e T} be a stochastic process. To each linear Borel 
set A there corresponds a finite or enumerable sequence {t,} such that 


(2.3) Pix, (w) cA, n> 1; x(w) ¢ A} =0, teT. 


More generally, let 4 be an at most enumerable class of linear Borel sets, 
and let x be the class of sets which are intersections of sequences of Zo 
sets. Then there is a finite or enumerable sequence {t,} such that to each 
t e T corresponds an w set A, with P{A,} = 0 and 


(2.4) {x (0) cA, n> l1; x(w) {¢ A} CA, Aek. 


We prove first that the truth of the first part of the lemma implies that 
of the second apparently more general part. In fact, if the first part is 
true, to each A €.0/, there corresponds a certain parameter sequence for 
which (2.3) is true, and then (2.3) is true for all A e.g if {t,} is the union 
of all these parameter sequences. Moreover, with this definition of {tats 
if the w set in (2.3) is denoted by A,(A), define 


NONA: 
AeA, 


Then, if A € 4, if Ag €%o, and if A C Ap 
{a,,(w) € A,n > 1; x(w) ¢ Ag} C {2 (0) € Aon = l; vho) ¢ Ao} 
Gin 


and the truth of (2.4) follows from the hypothesis that A is the intersection 
of a sequence of% sets. To prove the truth of the first statement of the 
lemma, let ż be any point of T. If t, * +, ty have already been chosen, 


define 
py = L.U.B. Pia, (w) € A,n < k; x(w) ¢ A}. 
£ teT 


§2 THE SCOPE OF PROBABILITY MEASURE 57 


Then py > pp =`. If p= 0, then 4, * - -, t, is the desired sequence. 
If py > 0, choose t}, as any value of ¢ such that the probability on the 
right exceeds p,(1— 1/k). Then, if p, > 0 for all k, 


Pix, (w) eA, n> l; z(o) g A} < lim pp tel. 
kow 
Since the w sets 
{v, (0) eA, n< k; x, (@) ¢ A}, k>1 


are disjunct, their probabilities form a convergent series, so that 


1 
lim p, = lim alı _ i) < lim Pfr, (@) € A,n<k; 2%, (w) ¢ A} 


ko kw k k>o 
=0. 


This equality, combined with the preceding inequality, implies the truth 
of the first part of the lemma, 

The following theorem shows that separability relative to the class of 
closed sets is no restriction on the finite dimensional distributions of a 
stochastic process {x,, £ eT}, that is, on the joint distributions of finite 
aggregates of the xs. In the language of §1 this means that separability 
relative to the class of closed sets is no restriction on the probabilities of 
sets in the field A». In other words, the condition is a restriction only 
on a probability involving non-denumerably many ,’s. This result is 
the best that could be hoped for. 

THEOREM 2.4 Let {x,, t e T} be a stochastic process with linear parameter 
set T. There is then a stochastic process {%,, t e T}, defined on the same 
@ space, separable relative to the class of closed sets, with the property that 


(2.5) P{č (w) = x(o)} = 1, teT. 


(The čs may take on the values 4-20.) 

Note that the joint distribution of finitely many of the #,’s is exactly 
the same as that of the corresponding 2's. The w set {&(w) = x(w)} is 
of probability 0 for each z, but this w set may vary with z. If the union 
of all these w sets, as £ varies, has probability 0, the x, process is itself 
separable relative to the closed sets. 

We give the proof of Theorem 2.4 in the real case. The trivial changes 
to cover the complex case will be obvious. Let .%/y be the class of linear 
sets which are finite unions of open or closed intervals with rational or 
infinite endpoints, and let ./ be the class of sets which are intersections 
of sequences of ./, sets. Then .o/ includes the closed sets. Let Z be an 
open interval with rational or infinite endpoints. We apply the preceding 
lemma to the stochastic process {x,, t € IT}, with Zo, Z as just defined. 


58 DEFINITION OF A STOCHASTIC PROCESS II 


According to the lemma, there is an at most enumerable set 7(/) C IT, 
and an w set A, z, such that 
PA, 3 — 0s telT 


{a(w)€ A, Se TU); xf) ¢A}CA,, Aes. 
Define 
= UT A 
I I 


Let A(/, w) be the closure of the set of values of x,(w) for fixed w and s 
varying in JS. The set A(Z, œ) may include the values + 00. It is closed, 
non-empty, and 

x(w) € AlI, w) if telT,w¢ A, 


Hence, if the set A(t, w) is defined by 
A(t, w) = N Al, œ), 
Tat 


it follows that this set is closed, non-empty, and 
xo) € A(t, w) if teT,w¢ A, 
For each ft, w we now define #,(w) as follows: 
E(w) = x(w), teS 
x(w), t¢S,w¢A,, 


ll 


and define (w) as any value in A(t, w) if t¢.S and w e A, With this 
definition we proceed to show that the %, process satisfies the conditions 
of the theorem. The condition (2.5) is obviously satisfied. Let A be a 
closed set. Suppose that J is an open interval with rational or infinite 
endpoints, and that w has the property that 


Z (w) € A, se SI, 


that is, AV/,w)C A. It follows from our definition of &(w) that if 
t e IT then 


%(w) = x(w) e AlI, w) ifte S 
or ift¢S,w¢A,, 
e Alt, wo) CAM, w) CA ift¢S,we A, 


Thus 
{&(@) € A, s € IS} = {č (w) € A, t e IT}, 


if A is closed and if Z has rational or infinite endpoints. If 7’ is any open 
interval, it can be expressed in the form 


=U, 
n 


§2 THE SCOPE OF PROBABILITY MEASURE 59 


where /„ has rational or infinite endpoints. Since the above set equality 
is true for Z = J,, it is also true (taking the intersection in n) for J = I’, 
This completes the proof of the theorem. We observe that we cannot 
exclude infinite values for (w), since the set Aft, œ) above may contain 
no finite values. In general, if there is a Borel set X to which the values 
of each œ, are restricted with probability 1, more precisely if 


Pf{a(w) « X} = 1, teT, 


it is possible to define #,(w) to take on values only in the closure of X 
on the infinite line (here the infinite line is supposed made compact by 
the adjunction of the points + 20). For example, if X is a finite closed 
interval, this means that the s can be supposed to have values only in 
this same interval, and are therefore necessarily finite-valued. If X is 
the set of positive integers, the range of values of each #, will be the set 
of positive integers, and also the point 00, unless the latter value can be 
excluded by further information on the actual distributions involved. In 
any event, of course, for each f, &, is finite-valued with probability 1. 

We conclude our discussion of separability with a few remarks on 
Theorems 2.1 and 2.2. In those theorems we considered only separability 
relative to the class ./, of closed intervals, and a few remarks on the 
extension of the theorems to separability relative to the class of closed 
sets are now in order. We omit the details since we shall not use the 
results in this book. Let Z be an open interval, let {2,, feT} be a 
stochastic process, and let S be an at most enumerable subset of T. Let 
A(I, ) be the closure of the set of values of x,(w) for s «JS, let 
A(t, 0) =Q, A(I, œ), and let A'(I, w), A(t, w) be defined in the same 
way except that in the definition of A’(/, w) s is not restricted to lie in S. 
By definition, S satisfies the conditions of the separability definition 
relative to the class of closed sets if there is an œ set A with P{A} = 0, 
such that for every open interval / and closed set A the sets 


{x (w) € A, t e€ 1S}, {x (w) < A, te IT} 
differ by a subset of A, that is, if 
A(I, w) = A'U, w), wg A. 


Moreover [cf. Theorem 2.1. (i)] it is even sufficient if the condition is 

(apparently) weakened by allowing A to depend on Z. If the x, process 

is known to be separable relative to the class of closed sets, and if 
P{x,(w) € A(t, )} = 1, teT, 


then [cf. Theorem 2.1 (ii)] the set S satisfies the conditions of the 
separability definition relative to the class of closed sets. This fact 


60 DEFINITION OF A STOCHASTIC PROCESS H 


implies that Theorem 2.2 (i) is true for separability relative to the class 
of closed sets. 

Let {x,, t e T} be a stochastic process. In order to take full advantage 
of the apparatus of measure theory, it is necessary to suppose for some 
purposes that x,(w) defines a measurable function of the pair of variables 
(t, @). Here t measure is taken as Lebesgue measure in T, œ measure 
as the given probability measure, and (z, œ) measure as the usual product 
of the two measures, supposed independent. The choice of Lebesgue 
measure on the ¢ axis, rather than some other extension of a measure of 
Borel sets, is made because in the applications to be made one is usually 
interested in ordinary Lebesgue integrals of sample functions and in 
properties of stochastic processes invariant under translations of the 
parameter axis. We therefore make the following definition: the sto- 
chastic process {a,, t e T} is measurable if the parameter set T is Lebesgue 
measurable and if (œw) defines a function measurable in the pair of variables 
(t, w). 

THEOREM 2.5 Let {x,, t eT} be a separable process, with a Lebesgue 
measurable parameter set T. Suppose that there is a t set T, of Lebesgue 
measure 0 such that 


P{lim x(w) = x(w)} = 1, teT—T,. 
aot 


Then the x, process is measurable. 
Define U,"")(w) by 


Ua) = LB, ae), f<rctt) ogee ne. 
dey itl a * a 
3 ten va 


Then {U,™, t e T} is a family of random variables, and as we have defined 
this family U; is a single random variable, the above L.U.B., in the 


part of each ¢ interval E i+ = | lying in T. Then U,"(a) is (f, œ) 


measurable. We define L,'")(w) in the same way except that L.U.B. 
replaced by G.L.B. Then 


Lo) < x(w) < Uo). 


According to the hypotheses of the theorem, for each że T— T, the 
extreme terms of this inequality converge to the middle term with 
probability 1, when n-» o0. Since these extreme terms are (t, œ) 
measurable, it follows (Fubini’s theorem) that the extreme terms have a 
common (ż, w) measurable limit for almost all (z, œ). Since this common 
limit must be x(w), it follows that x,(œ) must define a (t, œ) measurable 
function, as was to be proved. 


§2 THE SCOPE OF PROBABILITY MEASURE 61 


The following theorem is general enough to cover all the specific 
stochastic processes discussed in this book. Its significance will be 
discussed below. 

THEOREM 2.6 Let {x,, te T} be a process with a Lebesgue measurable 
parameter set T. Suppose that there is a t set T, of Lebesgue measure 0 
such that 
(2.6) p lim x, = £p teT— T, 

sot 
There is then a process {%,, t e T}, defined on the same w space, which is 
separable relative to the closed sets, measurable, and for which 


P{ž (w) = x()} = 1, CeT. 


(The čs may take on the values + ©.) 


According to Theorem 2.4, it is no restriction to assume that the v, 
process is separable relative to the closed sets, and we shall do so. We 
shall also assume that Tis a bounded set, and that |a,(w)| < 1 for all 1, w, 
since the general case can be reduced to this one by simple transformations. 
If Z is an open interval, and if w is fixed, let A(/, œ) be the closure of the 
range of values of x,(w) for t e IT, and define 


A(t, œ) = N Al, œ). 
Tat 
Let {,} be a parameter sequence satisfying the conditions of the separ- 
ability (relative to the closed sets) condition. We observe that 
x(w) € A(t, œ), and that any process {#%,, t € T} satisfying the conditions 
P{č, (w) = x, (a) S JL 
E(w) € A(t, w), all t, w, 
is necessarily separable relative to the closed sets. For each positive 


integer n let s,™, - + +, s,(") be the values f, * * * tn arranged in increasing 
order, and define s; ™ = — %0, 


falt, 0) = £y m0), e a EES O E 
Then f, is (t, w) measurable, and according to (2.6) 
p lim f(t") = ta teT—T,. 


This limit equation implies that 
lim EAG 0)—felt, o)|}=0, teT—T, 


m, n> 2 


and hence 


lim [ El f(t, ©) — f(t, o)l} dt = 0. 


MND Ip 


62 DEFINITION OF A STOCHASTIC PROCESS I 


It follows that the sequence {f,} converges in (¢, œ) measure, so some 
subsequence { dies converges for almost all (t, œ), to a (t, w) measurable 
limit function f, defined only at the points of convergence of this sub- 
sequence. By Fubini’s theorem, there is a subset T} of T, of Lebesgue 
measure 0, such that the sequence of œ functions {fa (t, -)} converges with 
probability 1 if £ e T— Tọ. We can suppose that T} D T, U {t}. Then, 
by (2.6), 
Piz (w) = f(t, w)} = 1, teT— Ty. 


Note that the function f(t, -) is not necessarily defined for all œ. Now 
define (w) as follows. If f(t, œ) is defined, and if te T— To, define 
(w) = fw). If f(t, w) is not defined, or if t e To, define ,(m) = x(w). 
According to this definition, 


P{č (w) = x(o)} Ab teT. 


Now % = %,,, Since t; e Ty, and furthermore (w) € A(t, w) for all t, œ, 
so that, according to a remark made at the beginning of the proof, the 
process {%,, t e T} is separable relative to the closed sets. Finally, this 
process is measurable, because #,(w) = f(t, w) for almost all £, w. 

The proof just given uses only the fact that (2.6) is true when s >t 
from above, and the hypotheses of the theorem can be correspondingly 
weakened. However, the theorem is not really made stronger in this 
way. In fact, if for each ¢ in a parameter set either 

plimz, or plime, 
att alt 
exists, then it is easily seen that (VII, Theorem 11.1) except for an at most 
enumerable subset of this parameter set both exist and are x, 

The following theorem is typical of the application of the concept of 
measurability of a stochastic process. The theorem justifies all integrals 
of sample functions used in this book. 

THEOREM 2.7 Let {x,, t eT} be a measurable stochastic process. Then 
almost all sample functions of the process are Lebesgue measurable functions 
oft. If E{x{w)} exists for t «T, it defines a Lebesgue measurable function 
of t. If A is a Lebesgue measurable parameter set and if 


J E{lz(o}} at < o, 
A 


then almost all sample functions are Lebesgue integrable over A. 

By hypothesis æ (+) is (¢, wœ) measurable. It follows (Fubini’s theorem) 
that x(w) is Lebesgue measurable in ¢ for almost all œ, that is, almost 
all sample functions are Lebesgue measurable, and that E{x,(w)} defines 
a measurable function of ¢, if this expectation exists. In particular, 


§2 THE SCOPE OF PROBABILITY MEASURE 63 


E{|z,(w)|} defines a Lebesgue measurable function of t, not necessarily 
finite-valued. The second hypothesis of the theorem is that the iterated 
integral of |2,(w)|, first in œw and then in fe A, is finite. The iterated 
integral in the reverse order is then finite, and the integral 


IZOL 
A 


is therefore finite for almost all œ, that is, almost all sample functions are 
Lebesgue integrable over T, as was to be proved. Since the value of an 
absolutely convergent iterated integral is independent of the order of 
integration, 
E| [ x(w) dt) = f Efe(o)} at. 
A A 

Suppose that fis a Lebesgue measurable and integrable function defined 
on the finite interval [a, 5]. It will be important for some purposes to 
approximate the integral of f by a Riemann sum 


(272 Rife aes 3 Romas) = a<: < Sa = D. 


It is clear that R may not be a good approximation to the integral even if 
5 = Max (Sı — s;) is small, since we have not supposed that f is 
Riemann integrable, but it will be shown that R is a good approximation 
for properly chosen s;s. If0 <t<b—a, and if we translate s}, * + *, 5, 
to s +i, ++, Sa + t (where we suppose that s; + t # b and decrease 
s; + ttos; + t— b + a whenever s; + t > b), we obtain a new Riemann 
sum R,L[f, S * °°, Sn], and we shall show that for most values of ¢, in a 
sense made precise below, R, is a good approximation to the integral of 
fif 6 is small. To show this it is convenient to define 


KA =f b +a), b<t<2b—a, 
(2.7) 


n—1 


Re lf Ss * 2s Sal = 2 fis, + t(Sj41— s). 


Then it is trivial to verify that R, and Rọ differ by at most three summands 
in the sum defining R, and one in the sum defining R/, so that 
boa 
(2.8) lim f |R,— Ri| dt = 0. 
be ae 


We now prove that 
b-a b 
(2.9) lim | [RIS so" sa] — f fs) ds | dt = 0. 
O: a 


64 DEFINITION OF A STOCHASTIC PROCESS If 


This equation shows, among other things, that for small 6 the measure of 
the ¢ set for which the translated s,’s do not give a Riemann sum which 
is a good approximation to the Lebesgue integral is small. To prove 
(2.9) let £ be a positive number, and let f, be a continuous function on 


[a, 2b— a], with 
2b—a 


J |fO-LO| dt < e. 
Then “a 
b-a 


b 
J IRES su + ud f JO dsl dt 


b-a b 
= j IRIA So" © Sn] — [fe +t) ds| dt 
0 a 


sj b-a 


fa [ify +9—fe +9] ar 


y 0 


n—1 


= 


=mi 


n—1 i b-a 
= 2 fas f | fs; + )— S(s + 1)| dt + 2e(b— a) 
8j 0 


—> 2e(b— a), (6 +0). 


Since e can be taken arbitrarily near 0, (2.9) is true if R, is replaced by 
Rý, and therefore is true as written, in view of (2.8), The following 
theorem applies this result to stochastic processes. 
THEOREM 2.8 Let {x a< t< b} be a measurable stochastic process 
(a, b finite). Then in the notation of the preceding paragraph 
b-a b 
(2.10) lim ii [Rilx(@), S3 * +) s,J— i x(w) ds| dt = 0 
‘ni a 
Jor all Lebesgue measurable and integrable sample functions. If almost all 
sample functions are Lebesgue measurable and integrable, to every £ > 0 
corresponds a choice of the s;s, say s; = ty, with 
b 
(2.11) <e. PUR[x(o), tp <-t] f x(w) ds| > €} < e. 
The first statement of the theorem is a trivial application of what we 
have proved. To prove the second statement let u(t, œ) be the absolute 
value in (2.10). Then (2.10) implies that, if e > 0, the Lebesgue measure 


§2 THE SCOPE OF PROBABILITY MEASURE 65 


of the ¢ set where u > e goes to 0 with 6 for almost all . It follows that 

the (t, w) measure of the (r, œ) set where u(t, œ) > e goes to 0, that is, 
b-a 

lim f Plu(t, w) > e} dt = 0. 


60 0 


Then, if 6 is sufficiently small, 
Plu(t, w) > e}<e 


on a t set of positive measure. Then this inequality is true for some 
ss with 6 < £, and some t. The set of 1,’s of (2.11) can be taken as the 
set of s,s translated through t. 


In particular if 
b 


(2.12) [Ella doy]} dt < 0 
it is easy to show that 
b 


(2.13) lim Elf Iriz, Ge ol | z (w) ds| at} =O) 


60 a 


In fact we need only observe that we have just proved that the integrand 
of this (¢, w) integral goes to 0 with ô in (t, œ) measure, and it is easy 
to verify that the integrands are uniformly integrable, so that integration 
to the limit is legitimate. 
Finally we observe that, if we suppose that (2.12) is true, and also 
assume 
(2.14) lim E{|z,—,|}=0,  a<s<b, 
tos 
then the averaging integration in ¢ in (2.10) and (2.13) becomes unneces- 
sary, because it can be shown that with these hypotheses 
b 
lim Ef R[x,(), 55° © * sa]— | (ev) dsl} = 0. 
60 a 
The limit equations we have discussed make it possible to define 
h 
| 2,() dt as a random variable which is a limit in a suitable sense of 


u 


Riemann sums. However, the use of Riemann sums to define the integral 
loses the useful literal interpretation of the integral as the ordinary 
integral of the general sample function. 

Let {x,, t e T} be a stochastic process. In §1 we defined a field F 7’ 
of measurable œ sets, the smallest Borel field of sets with respect to which 


66 DEFINITION OF A STOCHASTIC PROCESS I 


all the x,’s are œ measurable and which contains every subset of any one 
of its sets of probability 0. The finite dimensional distributions determine 
the measures of Fp’ sets, but do not determine the measure of any 
measurable set not in Fz’. Let {%,, tT} be a stochastic process with 


P{E(w) = x(w)} = 1, teT, 


and such that for each ¢ the œ set where ,(w) + x(w) is an Fy set. 
The v, process will be called a standard modification of the x, process in 
the following discussion. We have not made a point of it in the state- 
ments of the theorems, but a glance at their proofs shows that the 2, 
processes in Theorems 2.4 and 2.6 are standard modifications of the Tı 
processes. 

For many purposes it is legitimate to replace a stochastic process by a 
standard modification. This change does not affect the finite dimensional 
distributions, but may decrease the field F 7’. 

If {x,, t € T} is a stochastic process, with standard measurable modifica- 
tions {x,", te T}, i = 1, 2, then 


Pir, (w) = a,()} = 1, ved, 
and it follows (Fubini’s theorem) that, for almost all w, 
tdw) = zw) 


for almost all ż, that is, corresponding sample functions are equal for 
almost all ¢, with probability 1. Thus we can define 


i x(w) dt 
A 


as the corresponding integral for a measurable standard modification of 
the given process, obtaining a unique random variable, if values on « 
sets of probability 0 are neglected. This will sometimes be convenient 
below. 

According to Theorem 2.3 every process has a standard modification 
which is separable. (Examples of non-separable processes will be given 
below.) Separability is thus not a restriction on the finite dimensional 
distributions. On the other hand, if {æn t € T} is a process with Lebesgue 
measurable parameter set, the process is necessarily measurable if T has 
Lebesgue measure 0, whereas, if T has positive Lebesgue measure, the 
measurability of the process is a restriction on the finite dimensional 
distributions. The following is a trivial example of a process (with a 
Lebesgue measurable parameter set) which is not measurable and which 
has no standard modification which is measurable. We impose no 
restriction on Q, but suppose that T has positive Lebesgue measure. 


§2 THE SCOPE OF PROBABILITY MEASURE 67 


There is then a bounded function af) of t eT, which is not Lebesgue 
measurable. Suppose that the finite dimensional distributions are 
determined by 
P{x(@) = a(t)} = 1, FET, 

With this definition, if {%,, t e T} is any standard modification of the a, 
process, E{#,(«)} = E{a,(w)} = a(t) and the č, process cannot be measur- 
able, or E{%(w)} would be measurable by Theorem 2.7. Less trivial 
examples of processes with no standard modifications which are measur- 
able processes will be given below. 

In discussing the measurability of a process {x,, t e T} with Lebesgue 
measurable parameter set T it is no restriction to assume that T is the 
infinite line (—oo, 00), In fact, if T is not the infinite line, define 
x(w) = 0 for t ¢ T to obtain a new process with an enlarged parameter 
range. The new process is measurable if and only if the old one is, and 
there is a measurable standard modification of the new one if and only 
if there is a measurable standard modification of the old one. 

Suppose then that {%,,—0co < t< œ} is a stochastic process, and 
define for each ¢ 


4 s 

x ™ (e, w) = ZeysjnlO), J on <t< sa 
J=0 Ele 
SOMO Ties 


It can be shown that the x, process has a measurable standard modification 
if and only if there is a value of c for which 
Pflim x,™(c, w) = X4,,<(@)} 
n>a 

for each ¢ not in some set of Lebesgue measure 0. (See the Appendix 
for references to equivalent results.) Moreover, it can be shown that 
there is a separable (relative to the closed sets) measurable standard 
modification whenever there is a measurable standard modification. Note 
that the stated necessary and sufficient condition is a condition on the 
measures of -Fp sets, that is, on the finite dimensional distributions. 
This condition implies, for example, that if the random variables of the 
process are mutually independent and have a common distribution (not 
concentrated at a single point) the process has no measurable standard 
modification. 

Now let {%,, t « T} be a stochastic process in which & space Q is function 
space and &, is the 7th coordinate function, as discussed in §1. We shall 
call such a process one of function space type. Processes of this type are 
the ones most commonly considered in the literature, since, as described 


68 DEFINITION OF A STOCHASTIC PROCESS I 


in I §5, they can be defined simply by prescribing a mutually consistent 
collection of finite dimensional distributions. These processes have the 
simple property that the general basic point @ coincides with the general 
sample function of the process. In discussing stochastic processes it is 
not possible to suppose that all processes encountered are of function 
space type. This fact is illustrated in the following example. Suppose 
that one wishes to consider besides the basic random variables of the 
Process, the 2,’s, some functions of the #,’s, say the squares. In other 
words, one wishes to consider -the process {%,?,teT}. This process no 
longer has the property that @ is the sample function of the process, it is 
no longer of function space type, and thus even in this trivial case it is 
necessary to use other processes. 

In spite of the preceding example of the impossibility of considering 
only processes of function space type, it is enlightening to see how the 
concepts of separability and measurability can be handled in this case. 
The following discussion, which is given without proofs because we shall 
not use this point of view, shows how these concepts can be treated 
without going to other types of processes. Standard modifications 
cannot be used for this purpose, because a standard modification of a 
process of function space type need not be of function space type. 

In our discussion of processes of function space type we shall use a 
notation which exhibits the range of values of the functions and the class of 
measurable sets. The process {z,, 7 € T, X, F} is the process of function 
space type in which © is the space of all functions with domain T and 
range X, and F is the class of measurable sets. In agreement with 
previous notation, ¥ p will denote the smallest Borel field of @ sets with 
respect to which the @,’s are measurable, and F p’ will denote the smallest 
Borel field of @ sets such that F4 > F pand that F »’ contains all subsets 
of its sets of probability 0. It will sometimes be necessary to suppose 
that X is a closed point set of the infinite line (or plane in the complex 
case), and in speaking of such a closed set we shall consider that the 
finite line has been closed at both ends, by the two points — co, + 0, 
and that the plane is the direct product of the coordinate axes, closed in 
this way. The problem we discuss is how to enlarge the Borel field F p 
to obtain a class 7, of measurable & sets which will make the process 
separable and measurable. It has been shown that, if X is the finite line 
(and the proof is even applicable if X contains at least two points), if ¢ 
is any real number, and if Z is any non-denumerable parameter set, then 


the © set 
oe {F (0) <0, te} 


has probability 0 if it is in the class ¥’. It follows that, if the process 
{f,a<t<b,X,F_'} (with X containing at least two points) is 


§2 THE SCOPE OF PROBABILITY MEASURE 69 


separable, if {7,,} is a sequence of parameter values satisfying the conditions 
of the separability definition, and if Z is an open interval of parameter 
values, then 


P{L.U.B. &(@) = 00} = P{L.U.B. %,(@) = 00} = 1. 
tel eer 


This may happen, and in fact the process under consideration will be 
separable for certain assignments of finite dimensional probability distri- 
butions, but this does not happen in the most interesting cases. It has 
been shown, if X is the finite line (and again the proof is applicable when- 
ever X contains at least two points), that the process {%,,a<1<b, X, Fat 
is not measurable for any assignment of finite dimensional probability 
distributions. These facts make clear the advisability of enlarging the 
field of measurable sets. The method of enlargement is described in the 
next paragraph. 

Let Ù be an @ set of outer measure | relative to the given field Fy’ of 
measurable sets, that is, we suppose that the relation T C MeFp 
implies that P(M} = 1. Define the Borel field F , as the class of all & 
sets A of the form 


RI APSO, ARS, 


and for each such A define 
P,{A} > P{A,}. 


It is easily shown that P,{A} is uniquely defined in this way, and is a 
probability measure, that F4’ C.F, and that 


P{A}=P{A}, Ac Fy’. 


The stochastic process {čą t e T} with Fz replaced by F, and P{-} by 
P,{-} will be called a standard extension of the given one. The standard 
extension depends on the choice of T. Note that a standard extension 
of the &, process still is of function space type. The only difference 
between it and the given process is that more classes of ¢ functions @ 
have been assigned probabilities. In this approach the counterpart of 
Theorem 2.4 is: 

THEOREM 2.4’ Let {Ë t eT, X,F p} be a process of function space 
type, with X a closed set of the infinite line (or plane in the complex case). 
There is then a standard extension of the process which is separable relative 
to the closed sets. 

We omit the proof of this theorem. (It is essentially the same theorem 


as Theorem 2.4.) 


70 DEFINITION OF A STOCHASTIC PROCESS I 


The problem of measurability is solved in an equally satisfactory 
manner. In fact the following theorem can be proved. 

THEOREM 2.9 Let {%,, t eT, X, F} be a process of function space type, 
with X a closed set of the infinite line (or plane in the complex case), and 
T a Lebesgue measurable set. There is a standard extension of the process 
which is separable and measurable if and only if there is a measurable 
standard modification. 

The proof will be omitted. 

We conclude the discussion of processes of function space type by 
applying the results to a rather trivial example already considered in this 
and the previous section. Let {%,, -o0 < t< œ, X, F 7" be a process 
of function space type, with Z’ as defined in the above discussion. 
Suppose that 

P) =O =1, -—w<1<oo, 


If the range X of the function values is the single point 0, the function 
space © contains only a single function, the identically vanishing one. 
In this case 

P{t(o) = 0—0 < 1< kal 


and the 4, process is separable and measurable. If X contains a second 
point, the č, process is neither separable nor measurable, and the 
probability 

P{E(@) = 0,—0 < t < o0} 


is not defined. This probability will be defined, and will be 1, for every 
separable standard extension of the process. It is easy to see that the © 
set I in terms of which the standard extension is defined can be taken 
as the set containing only the identically vanishing function. This 
extension makes the process separable. Every separable standard exten- 
sion is measurable, by Theorem 2.5. On the other hand, a second 
standard extension can be defined in which the set I just used is replaced 
by its complement. With this definition 


P{ (5) = 0,10 < t < œ} = 0, 


but this standard extension is neither separable nor measurable. This 
example shows the rather arbitrary nature of the standard extensions, and 
the necessity of a new criterion, such as separability, to aid in determining 
the choice of the extension. The situation in this respect is the same as 
for standard modifications of processes. A new criterion like separability 
must be used to determine the choice of standard modification also. 

Let © be a space on sets of which a probability measure P is defined. 
Let x be a random variable with a distribution function F. All we shall 


§3 GAUSSIAN PROCESSES 71 


require of F is that it does not define a distribution confined to a single 
value. Define y(w) =0. Then we can think of the random variable y 
as a function of the random variable x. This is, of course, trivial. It is 
not true in all cases, however, if y is the identically vanishing random 
variable on some Q, that we can write y(m) = f[x(w)], where x has the 
distribution function F, for the simple reason that there may be no such 
random variable x defined on Q. For example, if Q consists of exactly 
one point (to which considered as a point set the probability | is assigned), 
it is clear that the distribution of every random variable is confined to a 
single value. In order to overcome the difficulties that are encountered 
in situations like this, in which Q and the probability measure P{A} are 
of too simple a structure for the problems considered, we define extension 
by adjunction, as follows. Let Q be a space on sets of which a proba- 
bility measure P“ is defined, į = 1, 2. Define QC® as the space of pairs 
w@ 2): (wW, w), wl e QÙ, and define a probability measure P on Q0? 
in the usual way as the product of the 2°) and Q measures considered 
as independent, so that, if A‘ is œ® measurable, and A® is w® 
measurable, 


PLA} = POLAM PEAD), A = {ol e A®, o® e AP}, 


Every w e Q then corresponds to an w® set, the set of all points 
(w®, cw) with the given first coordinate. This correspondence makes a 
measurable œw set go into a measurable w“? set of the same probability. 
A random variable a on Q can also be considered a random variable 
on Q2), and will have the same distribution function. If a stochastic 
process is defined using the space 2"), we obtain in this way a stochastic 
process with the same finite dimensional distributions, separability and 
measurability properties, and so on, defined on the space 2”), The 
new process will be said to be obtained by adjoining Q® to QW, This 
adjunction procedure is, of course, the procedure used in discussing 
independent trials, and results in a space and probability measure with a 
finer structure than the initial one. For example, if Q is the unit 
interval, and if Q? probability measure is Lebesgue measure, there are 
Q@ random variables with all distribution functions, and therefore Q0% 
random variables with all distribution functions. 


3. Gaussian.processes—strict and wide sense concepts 

A stochastic process {x,, t e€ T} is called Gaussian if the joint distribution 
of every finite set of the a,’s is Gaussian, These processes are of great 
importance theoretically as well as in applied work, and hence deserve 
separate consideration. The following theorem shows how the most 
general Gaussian process is determined. 


72 DEFINITION OF A STOCHASTIC PROCESS If 


THEOREM 3.1 Let T be any finite or infinite aggregate. Let u(:) be any 
function of t eT, and let r(-,+) be a function of s and t (s, t € T), satisfying 
the conditions 


(a) r(s, t) = r(t,s), 


(b) if t + > +, ty is any finite T set, the matrix DCm tn)] is non-negative 
definite. 
There is then a Gaussian stochastic process {x,, t e T} for which 


Efx} = u(t) 
GD ie 
E{x,é,}— p(s) u(t) = r(s, t). 
If the functions u() and r(-, +) are real, there is a real Gaussian process 


{a,, te T} Satisfying (3.1). In any case there is a complex Gaussian process 
{£o t € T} satisfying (3.1) and also 


(3.2) Efx} = u(s)u(t). 


Suppose first that u(t) is real and that r(s, t) is real and satisfies (a) 
and (b) of the theorem. If ti» * * +, ty is any finite ¢ set there is then an 
N variate Gaussian distribution with mean values M(t), * + +, (ty) and 
covariance matrix [r(t,,,t,)]. This is the distribution function with 
characteristic function 
N 
(3.3) e mn=l 


N 
Mims tpmay +i E Atty) 
m=1 


If the matrix [r(¢,,, ,,)] is non-singular, this distribution has density 


S 
G4) EHe o $ m, nap NEn An NEn lind) 
(2n)%? 3 


where the matrix [a,,,], with determinant |a,,,|, is the inverse of the 
matrix [r(t,,, t,)]. Moreover, if M < N the distribution defined by the 
characteristic function (3.3) assigns to x,,- + +, ma Gaussian distribution 
with means j(t,), - + +, (ty) and the covariance matrix [rlin t,)] with 
m,n<M. Thus if M< N the marginal distribution of Ry hy, is 
the same as that assigned to %,* * *,%,. Consequently, the consistency 
conditions of Kolmogorov (I §5) are satisfied, and there is a real Gaussian 
process satisfying (3.1), with œ space a certain coordinate space. The 
distributions involved are determined uniquely by the assigned means and 
covariances, 

We complete the proof by dropping the hypothesis that the process is 
necessarily real, and defining a Gaussian process which satisfies (3.1) and 
(3.2). The process is obviously uniquely determined by these conditions, 


§3 GAUSSIAN PROCESSES 73 


or, more exactly, the finite dimensional distributions involved are uniquely 
determined by these conditions. Note that according to these conditions 


E((z,— nF} = 0. 
The process. will therefore only be real in trivial cases, in fact only when 
u(t) is real and r(s, t) = 0, in which case 
Pf{x(o) = u(t} = 1, teT. 

We shall define real random variables &,, 7, to satisfyt 

Efe} = RUD) Ely} = ul} 
EE} — ECEJE(ES = Efn} — En JE(nd 

= ERir(s, D} 

Efè} — ELEJE} = — 3316, 0}. 


The relations (3.5) imply (3.1) and (3.2) if x, = ë; + in, To show that 
the £, 7, process really exists, we apply the criterion for real processes 
already proved. That is we show that the covariance matrix of any 
finite number of &,’s and 7,’s, as defined by (3.5), is symmetric and non- 
negative definite. It is no restriction to take corresponding pairs of &,’s 
and 7,’s, that is, we take 
Ero PFEF by dip? h 

and investigate the corresponding 2N-dimensional covariance matrix 
[Pmnl defined by (3.5). The matrix is obviously symmetric. Moreover, 
if Ay,* + +, Agy are real, 


(3.5) 


2N N 
(3.6) È PmndmAn =$ 2 Rms ty )}Aman + Ans mAnien) 
1 


m, n= m, n= 


a S O E R 
m,n=1 


N 
=$ È rtm tr Aman + ÅNtmâNtn) 


į N 
= È Mim ivimtn— Annin) 
1 


m, n= 


N 
=$ D (im tnm AnimAn + inan) 
m,n=1 
>0 
+ Throughout this book, R and 3 will be used to denote “real part” and “imaginary 
part,” respectively. 


74 DEFINITION OF A STOCHASTIC PROCESS Il 


since the matrix [r(¢,,, f,)] is non-negative definite. Hence the matrix 
[Pm, n] is non-negative definite, as was to be proved. 

This theorem can be used to obtain Gaussian processes intimately 
related to given processes, as far as first and second moments are con- 
cerned, For example, according to the following theorem, if a y, process 
is given, a corresponding x, Gaussian process exists with the same means 
and covariances. This means that in a discussion involving only means 
and covariances of the y,’s, it is no restriction to assume that the Yı 
process is Gaussian. If the y, process is real, the corresponding 2, 
process can also be supposed real. This means that, whenever two y,’s 
are uncorrelated, the corresponding 2,’s are independent. Whether the 
y's are real or not, the x, process can be chosen to satisfy (3.2), and 
uncorrelated y,’s will then correspond to independent 2,’s. If the y, 
process is not real, the x, process is not real either, and the assignment of 
the x, means and covariances do not determine the æ, distributions. One 
token of this fact is that it is possible to superimpose the arbitrary 
condition (3,2). 

THEOREM 3.2 Let {y,, t € T} be any stochastic process, with 


E{|y;,/2} < 00, teT. 


The parameter set T may be any finite or infinite aggregate. There is then 
a corresponding Gaussian process, with the same range of the parameter, 
but defined on a different measure space, whose random variables {x,} 
satisfy the equations 
E{x,} = 0, 
(3.7) S,teT. 
Efe} = Efyg} 


If the y, process is real, there is a real x, process satisfying (3.7). In any 
case there is a complex Gaussian process satisfying (3.7) and the Jurther 
condition 


(3.8) Efx,z,} = 0. 


If the y, and x, processes are both real, and satisfy (3.7) or if both (3.7) 
and (3.8) are satisfied, the orthogonality of two y,st implies independence 
of the corresponding xps. 

If y, is replaced by y,— E{y,} in the above, zero correlation of two y's 
implies independence of the corresponding x,'s. 

More generally, if v, v, are any two random variables in the closed linear 


+ The random variables u, v are said to be orthogonal if E{u} = 0. 


§3 GAUSSIAN PROCESSES J5 


manifold} of the ys and if uy, u are the corresponding random variables 
in the closed linear manifold of the x,’s, then (3.7) (for all Yy, y;) implies 


Efu} = 0 


(3.7/) 
Efu} = Efv,ð} 


and (3.8) (for all x,, x,) implies 
(3.8) Efu} = 0. 


Applying this theorem to the real and imaginary parts of the random 
variables of a given complex process, it is clear that there is a complex 
Gaussian process whose real and imaginary parts have the same covariance 
relations as the real and imaginary parts of the given process. Then 
(3.7) is certainly satisfied, and also 


(3.9) E{x,x,} = Efyy)}- 


The exact covariance correspondence afforded by (3.9) in conjunction 
with (3.7) is not useful, however, and destroys the correspondence between 
orthogonality and independence. 

The proof of Theorem 3.2 is accomplished simply by noting that, if 
r(s, t) = Efy,9;}, then r(s, t) = r(t, s), and, if %1, * + * Zy are any complex 
numbers, 

N N N 

> rtm nmn = È Ef yr, Te} mn = E{|> mHip l z0, 

m,n=1 m,n 1 

so that the hypotheses of Theorem 3.1 are certainly satisfied, if we take 
u(t)=0. The extension of the previous theorem involved in (3.7) and 
(3.8’) is trivial, since the latter relations are obvious when v, and v, are 
linear combinations of the y,’s; the general case is proved by going to 
the limit. The theorem states in abstract language that there is a linear 
transformation taking the closed linear manifold of the ygs into that of 
the as, taking y, into x, for each t, which leaves the inner product 
Efv,ð,} invariant. (The transformation is unitary.) 

One of the reasons why Gaussian processes are important is the 
simplification the Gaussian hypothesis brings to the theory of least 
squares approximation. Let the random variables v, y have a bivariate 
Gaussian distribution, with zero expectations. We suppose that x and y 


+ The closed linear manifold determined by a set of random variables is the collection 
of random variables which are finite linear combinations of the given random variables 
or limits in the mean of sequences of such linear combinations. 


76 DEFINITION OF A STOCHASTIC PROCESS H 


are either real or, if not, satisfy the relations Ef{xy} = 0. Then the 
difference 


Efyt} 
3.10) =a,  @=— >, 
: ZEE 

has zero expectation and is orthogonal to and uncorrelated with and 
therefore independent of x. It follows that the conditional distribution 


of y for given w is Gaussian, with expectation ax and variance 
[Etye}/? | 
Ef{/x/}E{y/?} 
The bracket is between 0 and 1 by Schwarz’s inequality. More generally, 
if the random variables 2,,-° „y have a multivariate Gaussian 


distribution with zero expectations and if the variables are either real or, 
if not, satisfy the relations 


E{x;x,} = Efx,y} = 0, te die ye, 
the difference, 


(8.10) y— > 25, 
1 


G1) Elly a = Elly fi 


has zero expectation and is uncorrelated with and therefore independent 
of x,,* * +, p for properly chosen a, + * *, a, (see IV §3). It follows that 
the conditional distribution of y for given a, ** -, x, is Gaussian, with 


(3.12) Ey |tu s tb = Z ast. 


The difference in (3.10) is independent of and therefore orthogonal to 
every function f measurable on the sample space of xy, * * +, x, (in par- 
ticular to every Baire function of these variables) whose square is 
integrable. Hence 


61) Elly- = Ely- Sae) T È aie, — fA 


= Efu — Š ace} + HID ae — f1: 


This equality shows that the problem of minimizing the left side of (3.13), 
the problem of /east squares approximation, is solved by setting f equal to 
the conditional expectation in (3.12), and that this solution is uniquely 
determined aside from the usual ambiguity in the definition of a condi- 
tional expectation. The Gaussian character of the distributions made it 


§3 GAUSSIAN PROCESSES T 


possible to infer independence from zero correlation. If we only suppose 
that £u * *, %,, y are any random variables with 
Ef} <,  Efly}< 0 
the difference (3.10’) with the a,’s calculated in the same way is still 
orthogonal to 24, ` * *; Ens from which it follows only that (3.13) is true 
n 


if fis a linear combination of the x,’s. Thus the random variable > ajx; 


1 
which we shall denote by Êfy | z' :,2,}, solves the problem of 
minimizing the left side of (3.13) for all linear combinations f of the gS, 
the problem of linear least squares approximation. 

It will be necessary to extend the above discussion to the case of 
infinitely many conditioning variables in later chapters. 

Many concepts which are used in the theory of stochastic processes can 
be formulated in two ways, in a “strict sense” or in a “wide sense.” The 
general principle is the following: Suppose a y, process has a certain 
property P expressed in terms of variances and covariances. Suppose 
the corresponding complex Gaussian process given by Theorem 3.2 
satisfying (3.7) and (3.8) has the corresponding but stronger property pi 
Then P’ is called a strict sense property and P a wide sense property. (If 
the y, process is real, the corresponding Gaussian process can be taken 
as the uniquely defined corresponding real Gaussian process given by 
Theorem 3.2, satisfying (3.7).) The point is that a wide sense property 
implies much more when it is also supposed that the underlying distribu- 
tions are Gaussian. We shall see that a great many theorems have strict 
sense and wide sense versions, and that the distinction serves to organize 
and clarify the theory of stochastic processes. 

Example 1 Let yy, Yz * ` * be mutually orthogonal random variables. 
We take this as a wide sense characterization, and find the corresponding 
strict sense characterization by noting that, if an x, process is determined 
as in Theorem 3.2, the a's will be mutually independent. Thus a process 
with mutually orthogonal random variables (or mutually uncorrelated 
random variables if y;— E{y,} is considered instead of y;) is a process 
with independent random variables in the wide sense. 

Example 2 The least squares analysis made above shows that the best 
linear least squares approximation Bly | x,° + +s} is the wide sense 
version of the best least squares approximation. We carry the analysis 
further by evaluating the best least squares approximation in the non- 
Gaussian case. Suppose then that ty, * * *%,,y are any random variables 
with E{|y,|?} < 00, and define 

Y = Efy | to't > ny. 


Th 
Ë Ellul} < EEY eo > 2} = Eli 


78 DEFINITION OF A STOCHASTIC PROCESS I 


and it follows from I, Theorem 8.3, that y — y is orthogonal to every 


function f which is measurable on the sample space of 2,° + +, x, and 
whose square is integrable. Hence 
(3.13) 


Efly—S I} = Ellly — y) + Yo—S)3 = Efy— yl} + Elly S 


so that the problem of minimizing the left side of (3.13) is solved by 
setting f= yo. Thus the strict sense version of Êfy | æu +% æn} is 
E{y |£ * +, %,}. The latter symbol is of course defined for any family 
of approximating random variables, finite or infinite, and solves the 
corresponding least squares approximation problem, since (3.13’) needs 
no change in the general case. The symbol Êfy |—} will be defined for 
infinite families of approximating random variables in TV. It obeys the 
same rules of combination as the symbol Efy |—} since the two become 
the same in the real Gaussian case (and in the complex Gaussian case if 
the random variables all satisfy the condition that the expected value of 
the product of any pair vanishes). 

Aside from theorems which are strict sense versions of wide sense 
theorems, very few facts specifically true of Gaussian processes are known. 
For this reason no later chapter will be devoted to Gaussian processes 
as such, although many types of Gaussian processes will be treated. 


4. Processes with mutually independent random variables 


These processes are discussed only in the discrete parameter cas2, since 
the sample functions in the continuous parameter case are too irregular 
to arise in practice. The processes under discussion are thus simply 
sequences of mutually independent random variables. Most studies of 
the “foundations of probability,” which commonly identify the mathe- 
matical with the philosophical foundations, to the detriment of the former, 
have as their purpose the study of a sequence of mutually independent 
random variables with a common distribution function. 

We have already remarked that, if F, Fẹ > - is any sequence of 
distribution functions, there is'a sequence of mutually independent random 
variables with those distribution functions. Because of the independence 
hypothesis, the joint distribution functions of the ,’s are simply products 
ofthe F;’s. Itis quite feasible to study these processes without mentioning 
the word probability. For example, if s, =a, + - * ++ 2,, the distri- 
bution function of s, is the convolution of those of x}, © - -, x,, that is, 


(4.1) 
P(s,(0) <= | ARE: o f FS, —+ + -~ Ena) dP (Ens): 


§5 UNCORRELATED OR ORTHOGONAL RANDOM VARIABLES 79 


The study of many of the properties of the partial sums of the series 


2 
> æ, can thus be carried on as a study of iterated convolutions, with no 


1 

mention of probability. To prove (4.1) we observe that, if a,- * +, 2, 
are the coordinate functions of the n-dimensional space of points 
(&, °° +, Ën), the probabilities of w sets are probabilities of (&, + + -, €n) 
sets, with 


P(A} = [++ f dF): + > AEE). 
A 


Then (4.1) is simply the evaluation by iterated integration of P{A} for A 
defined by the inequality > £;<A/. Finally, using the representation 
1 


theory of I §6, it is no restriction in proving (4.1) to assume that the 2,’s 
are the stated coordinate functions. The proof given also proves the 
following slightly more general equation, which we shall have occasion 


to use: 
(4.2) Pizo) <4, f= 1, n— 1, s0) <4} 


Ai any 
= | dRG)- RAS m + — Fx) Palos): 


5. Processes with uncorrelated or orthogonal random variables 


A process with uncorrelated random variables is one whose random 
variables are uncorrelated in pairs, that is, it is supposed that 


(5.1) E{|x,?} < 20 
and that 
(5.2) Efx} = E{xJE{zj}, sAt. 


A closely related type of process is a process with orthogonal random 
variables, that is, one for which (5.1) is true and for which 


(5.3) E{x,é}=0, s#t. 
If an x, process is a process with uncorrelated random variables, the 


Yı process determined by 
Yı = tı Efx} 


is a process with orthogonal random variables. Thus, if x is a real random 
variable, uniformly distributed in the interval (0, 7), the random variables 


sin a, sin 27, * + 


80 DEFINITION OF A STOCHASTIC PROCESS I 


constitute both a process with uncorrelated random variables and a 
process with orthogonal random variables (with zero means). On the 
other hand, the random variables 

1 + sina, 2 + sin 2x,» + - 


constitute a process with uncorrelated random variables but not one with 
orthogonal random variables. 

As already remarked in the introduction, the theory of probability is 
but one aspect of the theory of measure. This is particularly clearly 
exhibited at the present stage. The theory of orthogonal functions and 
series has occupied mathematicians for over 100 years, but has never 
been considered a part of probability. From the present point of view, 
a sequence of orthogonal functions is a sequence of orthogonal random 
variables, and theorems like the measure theoretic Riesz-Fischer theorem 
become probability theorems. The probability approach, however, 
imposes the rather pointless condition that the basic measure space have 
total measure 1. Up to the present time this condition has been imposed 
by the physical interpretation of probability, although many of the ideas 
and definitions used in probability theory do not intrinsically presuppose 
the condition. Certain physical situations (such as a free particle in 
quantum mechanics which is equidistributed throughout space) even 
suggest the usefulness of a probability measure with a value + œo for the 
whole space. For these reasons and because of later applications in this 
book, the hypothesis of finiteness of measure is dropped in IV, where 
orthogonal series are discussed, and to avoid confusion the customary 
language of measure theory is used. 

As already noted in §3, the processes of this section are the wide sense 
versions of processes with independent random variables. Although this 
terminology is not ordinarily used, the point of view will prove 
illuminating. 


6. Markov processes 


(a) Strict sense In the following we shall consider only real processes; 
the changes to be made in going to the complex case will be obvious. A 
(strict sense) Markov process is a process {x,, t e T} satisfying the following 
condition: for any integer n > 1, if fi < > ++ < t, are parameter values, 
the conditional x, probabilities relative to x,, + + +, x, , are the same as 
those relative to x, _, in the sense that for each A 


(6.1) Pix, (@) SA | + a, }= Px, (0) <A | x, } 
with probability 1. Then, if T, C T, the process {x,, teT;} is also a 


Markov process. When a process is called a Markov process, “strict 
sense” is always understood. 


$6 MARKOV PROCESSES 81 


Condition (6.1) is at first glance weaker than 
(6.1) Pix, (œ) <A| Te? * Tira = fita te) 
(to hold with probability 1), where the random variable on the right is 
a Baire function of x,,, or more generally is measurable on the sample 
space of x. Actually, however, fis then necessarily given by the right 
side of (6.1). In fact, if the operation E{— | x, ,} is performed on both 
sides of (6.1’), the right side is unchanged, and the left side becomes the 
right side of (6.1), by I (10.9). Because of the equivalence of (6.1) and 
(6.1), a Markov process is sometimes loosely described as one in which 
the conditional probability on the left in (6.1) depends only on æ, (@). 
The condition (6.1) is also equivalent to the condition that, if s4 < sz 
and if A is arbitrary, then 


(6.2) P{x,(@) <A | x,t < s1} = Pr (0) <A | ,} 
(also with probability 1). In fact, if (6.2) is true with probability 1, and 
if 4¥}<+++<t,4=5,<t,=5, we can perform the operation 


E{— |£" * +, «,,_,} to both sides of (6.2) to obtain (6.1) (with probability 
1). Conversely, if (6.1) is true with probability 1, then 


J P{x,(o) <A | x,}dP = Pix, (0) <4, o € A} 
A 

whenever A is measurable on the sample space of finitely many «,’s with 
t < s, and we have seen in I §7 that it is sufficient to verify this equality 
for these sets A to be able to identify the integrand with a version of the 
left side of (6.2). 

We shall now prove that, if {x, t e T} is a Markov process, if s e T, if 
y is an w function measurable on the sample space of the x,’s with ¢ > s, 
and if E{|y|} < 00, then 
(6.3) Ely | to t < s} = Ely | 2) 
with probability 1. This equality contains (6.2) as a special case, and 
thus, as we have seen, implies (6.1). We shall call the validity of (6.3) 
the Markov property. The Markov property will be derived first under 
the assumption that T contains only finitely many points > s. Let these 
points be s= ua e U Let Apr sAm be arbitrary 
numbers. Then, if y is defined by 

yo)=1 ifs (o) <4, j=0: m 
= 0 otherwise, 
(6.3) reduces to 
(6.3) Pf{x,(@) <å j=0,: n m |tt < s} 
= Pfr (0) <A;,j =0,- + m |2. 


82 DEFINITION OF A STOCHASTIC PROCESS Il 


Conversely, if (6.3’) is true with probability 1 for arbitrary 29, © © -, Ams 
then (6.3) holds with probability 1 if y is measurable on the sample space 
of %,,° * "s Cu» and if E{|y|}< œ. In fact, under the stated hypothesis, 
if the œ set M is defined by 


M = {[x,,(@), + + +, 2, (0)] € A}, 
it follows trivially that 
(6.4) P{M | za t <s} = P{M | z} 


with probability 1, if A is an m + 1-dimensional right semi-closed interval. 
Since the class of sets M of this type, together with the w sets of probability 
0, generates the class of w sets measurable on the sample space of 
Eus” +, Cu it follows (I, Theorem 9.6) that (6.3) is true with probability 
| whenever y is as described. We now proceed to prove (6.3) by induction 
inm. If m= 0, (6.3) is trivial, since both sides of the equation reduce 
to y, with probability 1. If m= 1, (6.3’)—and therefore (6.3)—is true 
with probability 1 because (with probability 1) 


Pf2,(@) < 2o, tu (0) < Ay | t,t <5} 
= Piru (0) < dg (0) SA, |x} =0 ife) > Ay 
and, using (6.2), 
Piz, (0) < Ay, £u l0) <A, | ty t< 5} 
= Piz (0) <A, lzot <s} 
= Pfr (o) <A, lz} 
= Pfr (o) < Aq Ealo) <A, |x} if alo) < Ap. 


Now assume that (6.3’) is true with probability 1 for some m>1. We 
then show that it is true with probability 1 for m replaced by m + 1. 
Let 2o, 4, be any real numbers, and define Y, z by 


yw) =1 if z(o) eH) =1 if Zu (0) Sij=l e m1 
= 0 otherwise =0 otherwise, 


Then, applying the induction hypothesis to the %, process, and also to 
this process with all parameter points < u, deleted except the point s = up, 


(6.5) Efe | ta t <u) = Efe | 2u} = Efe | tp tu} 


$6 MARKOV PROCESSES 83 


with probability 1. Now 
(6.6) E{yE{z lre teau | Py tes s) 
= EfE{yz its m)| Leh s) 
= Efyz | x,t < s} 
= P{r, (0) <4, j= 0, + ,m+1[a%t<s} 


with probability 1, and, using the induction hypothesis with m = 1, 


(6.7) EļyE{z [ta tu lao tS s) 


= EfE{yz lz ale t< s) 
= E[E{yz [25253 | z) 


= Efyz | x) 
= Pfr, (0) Sån j= 0: m +1] 2} 


with probability 1. Combining (6.6) and (6.7) with (6.5), we obtain 
(6.3’)—and hence also (6.3)—with m replaced by m+ 1. Finally, 
suppose that T contains infinitely many points > s. Then the result just 
obtained implies that (6.4) is true with probability 1 if M is measurable 
on the sample space of some finite aggregate of x,’s with t = s. It then 
follows (I, Theorem 9.6) that (6.3) is true with probability 1 if E{|y|} < %, 
and if y is equal with probability 1 to an w function measurable with 
respect to the Borel field of œ sets generated by the class of sets M, that 
is, if y is measurable on the sample space of the xs with ¢ > s, as was 
to be proved. 

The definition of a Markov process given above is one-sided in £. An 
alternative definition, stated in the form of a necessary and sufficient 
condition, is the following, which, because it is two-sided, shows that, if 
the process {a,, t eT} is a Markov process, that with variables {x_,} is 
also a Markov process. A process is a Markov process if and only if, 


for any positive integers, m, n, and real numbers Ay, ` ` ', Ams Has" * + Hw 
ifs L> < Sy L< h<: < tn are parameter values, then 
(6.8) Pfa,(w) Sinj = ly: + +15 T0) S hok = > n | ed 


= P{z,() <A, jf = l- m |2} P{e,(@) < be k = 1,00 0 | x} 


with probability 1. It is tempting to restate (6.8) by saying that for fixed 
x(w) the sets of random variables x, - + ', x, and Ty, +, Xi, are 
mutually independent; that is, for x,(w) (the present) known, the past and 


84 DEFINITION OF A STOCHASTIC PROCESS I 


future are independent of each other. However, the ambiguity in the 
definition of conditional probabilities makes the interpretation of such 
statements rather delicate. The following is an intuitive and suggestive 
proof of the equivalence of (6.8) and the Markov property, in the spirit 
of the time interpretation of (6.8). A rigorous proof will also be given. 
For given x(w), (6.8) states that a certain past event and a certain future 
event are mutually independent, in view of the multiplying of the corre- 
sponding probabilities (all conditioned by the knowledge of the present). 
An alternative condition for independence is that certain conditional 
probabilities do not actually involve the conditioning variables; specific- 
ally, if P“ denotes probabilities conditioned by x, this version of the 
independence condition becomes 


(6.8) POfe (o) < upk= l, +n [a x, } 
= Pa, (@) < up k= 1,- + +, n} | 


with P probability 1. We have seen in I §10 that this equation can be 
written in the form 


(6.8) Pfa (o) < upk=l, n EARE 
= P{r (0) < yk =1,+ + +n | x}. 


Ifn = 1, this equation reduces to (6.1); for arbitrary n it is still a special 
case of the Markov property. 


z by 


The following is a rigorous discussion of the condition (6.8). Define y, 
Yo)=1 if z (w) <A4,j7=1,-+ +m 
=0 otherwise; 
2(w) = 1 if z (w) < yk =1,+ + +n, | 


=0 otherwise. 
Then, using I, Theorem 8.3, we find that 
(69) Ely | eJBfe | x} = EfyEfe |x) |æ) 


with probability 1. If the process is a Markov process, the right side 
becomes, using I, Theorem 8.3, once more, 


(6,10) E[yE{z [aes BE = EfE{yz [ees o| z) 
= Efyz | x} 

with probability 1. Combining (6.9) and (6.10), we find that 

(6.11) Efy | xJEfz | z} = Efyz | x,} 


§6 MARKOV PROCESSES 85 


with probability 1, and this is precisely (6.8). Conversely, if (6.11) is 
true with probability 1, it follows that 


Efye | x} = Ely | x JE(e | x} = EfyEfe |x} | x 
with probability 1, so that 


fue dP = [yet |x} dP 
A A 


if A is a set measurable on the sample space of x, But then 


zdP= [ Efe|z}aP, 


(Wo) =A {yo)= A 
so that 
(6.12) Joar- [Be |=) 4P 
M 
if M is an w set of the form EA (w) <A, j= 1,- m, &(w) < A}, and 
we have seen in I §7 that equality i in (6.12) for these sets M is sufficient 
to identify the integrand on the right with E{z | ,,* * +, %,,%, that is, 
P{x, (0) < py k = 1,7 n |East s Bey x} 
= Piz (w) S wk = 1,+ yn | a} 


with probability 1. This is (6.1) in a slightly different notation. 

Example 1 Let y;,z,* + * be mutually independent random variables. 
Then the process Yn n= Hi is a Markov process, and in fact P{y,,(w) < A} 
is one version of both P{y,(@) < A | Yr" * Yny and Ply,(@) < 3 IYn 
Thus in this case the conditional distribution of y, relative to Yy1,* * *s Yn-1 
exists in the technical sense of I §9. 

Example 2 Let the y,’s be as in Pappie 1, and define x, = > y; 

1 
Let Fn be the distribution function of > PAT Then the a, process is 
a Markov process, and in fact, if s < 1, one version of both the conditional 
distribution of x, relative to a, and that relative to 2,,° * *, x, is given by 
Pizo) SA | £p + +, ty = Fah — z). 


This is intuitively clear, but it may be instructive to give a formal proof. 
We must show that, if A is an œ set measurable on the sample space of 
24, ‘+ *, ay Or, Which amounts to the same thing, measurable on the 
sample space of y1, * * *, Ys, then 


(6.13) ji F,(A—a,) dP = Plo € A, x(w) <2}. 
A 


86 DEFINITION OF A STOCHASTIC PROCESS I 


To derive this equation it will be convenient to assume that the s + 1 


random variables t 
Yo Yo D Ys 
atl 


are the coordinate functions in the s + l-dimensional space of points 
Mhi * ‘sy Mees) with 
P{(m, 2% Mya) € A} = f: RA farm) "e dF yi) 
A 


where > 
R= Fan Jahier ss 


for every s + l-dimensional Borel set A. According to the representa- 
tion theory of I §6 it is no restriction to make this assumption. Then, 
translating (6.13), it is sufficient to prove that 


J ARD +f dEr) f Fo Z n) dE) 
A 


= Pí + +) € A, x(a) < ay 


for A an s-dimensional Borel set in (7, * * *, 7,) space. It is sufficient 
to prove the equality for A an s-dimensional interval determined by the 
inequalities 5 
=0< YS, fss, 

and the equality then becomes a special case of (4.2). 

Example 3 Let P,, Py, * * * be functions of Ẹ, A, where ¢ is a real 
number and A a linear Borel set, with the following properties: 

(i) PẸ, A) defines a probability measure of A for fixed ë; 

(ii) PÈ, A) defines a Baire function of & for fixed A. 
Let P be a probability measure of linear Borel sets. Then there is a 
Markov process {x,, n> 1}, such that P,(x,,A) is one version of 
P{x,.,(@) € A | x,} and that 


P{x,(w) € A} = P(A). 


To see this we observe first that, for every m, with the obvious conventions, 


J Play) | Pien dba) > +f Pralni dên) 
Am 


defines a probability measure of m-dimensional Borel sets A, in the space 
of points (i ++, Èm) We define a sequence of random variables 
{x,, > 1} such that the distribution of z, * - +, x,, is given by the above 
multiple integral. This is possible, with the z,’s the coordinate functions 


§6 MARKOV PROCESSES 87 


of infinite dimensional space, if the above measures are mutually con- 
sistent (see I §5), and it is trivial to verify that they are. It is then also 
trivial to verify that the a, process obtained in this way is a Markov 
process, with 

Pirna (o) € A | ayy* i En) = Pap» A) 


with probability 1. In particular, if P and the P,’s are given by densities, 
so that 
P(A) = | pln) dr, PAE, A) = | PAE, n) dn, 
A A 


the distribution of 2,* + +, 2,, has density 


PE Dp (Ey 5 Pn—Em—1» Èm): 


In this form the significance of the Markov property lies in the fact that 
p; does not involve &,* + *, jı We have seen that the x, process 
reversed in time is also a Markov process, In the density case just 
described, the reversed transition probabilities can be exhibited directly 
in an elegant form. The conditional distribution of x, for Xalo) = Enpa 
has density (in &,,) 


(6.14) 


Joe fondest md > + Pastas Ea) dh °° dhina 


% -% 


[e+ +f PO Palim Ener) dm * + din 
(Here we have assumed that the p,’s are Baire functions of their pairs of 
variables. Actually it is easily seen that these densities can always be 
chosen to have this property.) Note that the forward transition proba- 
bility density function p, can be chosen quite independently of the initial 
probability density function p, so that a change of p does not force a 
change of p,. Once the pẹ's are chosen, however, the reverse transition 
probability density function (6.14) will in general depend on the choice of p. 

Now let {x,, > 1} be any Markov process with the indicated para- 
meter set. We have seen in I §9 that there is a function P,,, of a linear 
Borel set A and a real variable, the wide sense conditional distribution 
of 7,4; relative to a, which determines a Baire function when A is fixed 
and determines a probability measure when the variable is fixed, such that 


Pfz n (0) «A | ,} = P(n A) 
with probability 1, for each n, A. Let P be the distribution of 2,, 
P(A) = P{x,(«) € A}. 


PulEnv Enea): 


88 DEFINITION OF A STOCHASTIC PROCESS i 


We show (cf. Example 3) that, if x is a Baire function of z, + + +, Xm then 
(615) E(x} = f PCE) | Pin dia) + + | Pn sEnre dEn). 

Using the Markov property, the first integration on the right yields a 
version of Efæ | æ, * * *, 2m} which is a Baire function of a, + + +, €m 
according to our work in I §9. The next integration yields 


E(E{x EE mat | Myra Sey Erma) = Eft | tr" + +, Emah 
and proceeding in this way the last integral yields 
E{E{e | 2,}| = Efa}, 


as was to be proved. Thus, if P, P,, Pa * + are used as in Example 3 
to define a Markov process whose variables are coordinate variables, the 
Markov process so obtained is a representation of the given process. 

Now let {x,, t eT} be any Markov process. Then, if s< 7< t, the 
transition probability satisfies the equation 


(6.16) Pedo) € A |2} = E[P(x(o) € A | x} |) 


with probability 1. In fact, the conditional probability on the right is also 
Px (0) € A | xy, xh 


in view of the Markov property, so that (6.16) is a consequence of the 
general theorems on conditional expectations and their iterations; see 
I (10.9). This equation is known as the Chapman-Kolmogorov equation, 
or, in particular cases, as the Smoluchovski equation, Generalizing the 
sequence of functions P,, Pp, + + + of the preceding paragraph, we now 
observe that there is a function P of £, s, A, t, with s < t, which determines 
a probability measure of the linear Borel set A when &, s, ¢ are fixed, and 
determines a Baire function of ë when s, A, t are fixed, such that 


(6.17) P(x, 5; A, t) = P{x (0) € A | x} 


with probability 1, for each s, A, t. Then (6.16) can also be written in 
the form 


o 


(6.18) PE, s; At) = | PE 7; A, DPE, s; dé, 7). 


-%0 


This equation holds for £ not in some Borel set B (depending on s, t, 7, A) 
with 
P{x,(w) € B} = 0. 


§6 MARKOV PROCESSES 89 


In the applications, one is usually given not a Markov process but 
transition probabilities in terms of which the process is to be constructed, 
Specifically, it is usually supposed that T has a minimum value fy, and 
that a function P is given, satisfying (6.18) identically in the free variables. 
If an initial distribution is given at fọ, a Markov process with the transition 
probability function P is defined as follows. If tọ <'**< fa the 
random variables 2, * * *, 2, are to have a joint distribution determined 
by the preassigned x, distribution and the preassigned transition proba- 
bilities as in (6.15) (with x defined as 1 on an m-dimensional Borel set, 
and 0 otherwise, that is, as in Example 3). We have already remarked 
that it is then trivial to verify that this definition will make the random 
variables x,,* * +, æ, a Markov process. The distributions assigned in 
this way are mutually consistent, that is, the Kolmogorov consistency 
conditions are satisfied, so that there is actually a (Markov) process with 
the assigned initial and transition probabilities (I §5). The transition 
function P is frequently supposed to be given by a density, 


PE, s; A, th= | p(s: n, t) dy, 


A 


and in this case (6,18) reduces to 
(6,19) Plé, si n1) = fe, T; M DPE, S; &, 7) db. 


Equation (6.19) is frequently justified intuitively by the statement that 
the probability of a transition from & at time s to 7 at time z is the 
probability of a transition to ¢ at the intermediate time 7 multiplied by 
the probability of a transition from č at 7 to 7 at £, summed over all 
values of č. There is nothing wrong with this statement, which is simply 
an imprecise paraphrase of (6.19), but it is imprecise enough to have led 
some unwary students to believe that (6.19) is true of all stochastic 
processes. Note, however, that without the Markov property the first 
factor under the integral sign in (6.19) would depend on ¢ and s, in general. 

A sequence of random variables {x} is said to constitute a multiple 
Markov process if there is an integer v such that for each A and each n 


(6.1") Phar) SA | Eni Cras * p= PCM) SA | ena? * s Ene) 


with probability 1. If v= 1 the process is then a Markov process 
(sometimes also called a simple Markov process). The generalization is 
not very significant, because the (vector) process with random variables 
{@,}, ®, = (ays * * * Engr) has the Markov property [defined for vector 
processes by the obvious modification of (6.1)]. Thus multiple Markov 
processes can be reduced to simple ones at the small expense of going to 


90 DEFINITION OF A STOCHASTIC PROCESS I 


vector-valued random variables. In particular, in the important case of 
random variables {x„} which take on only a fixed finite set of values 
(Markov chains), the ĉ„ process will be one of the same type; the variables 
{@,} need not be considered vector variables but simply variables taking 
on N” values, if the x,’s take on N values. 

(b) Wide sense Let {2,, te T} be a Gaussian process with zero means, 
E({x,} = 0, and which either is real or, if not, satisfies the equation 

E{x,2z,} = 0 

(cf. Theorem 3.2). To define a Markov process in the wide sense we must 
discover what apparently weaker property of the x, process, defined in 
terms of variances and covariances, is equivalent to the Markov property, 
or at least to an important special case of this property. We have already 
remarked (§3) that if t; < ` + ° < t, one version of E{a, | 2,,,° + +, %,_,} 
is that linear combination of %,,- * +, £y, p Ef, | £as * t +, Zay in the 
notation of §3, which is the closest to x, in the sense of minimizing 


n=l n 
(6.20) E{|a, — > aa 3 = 2 Elerai (a, =— 1). 
i k= 


Now the Markov property implies that 
(6.21) Efx, EET fs Efte, | tna} 


with probability 1, and in fact considerably more, since the Markov 
property applies to the conditional distribution of x, , not merely to its 
conditional expectation. However, the condition (6:1) is equivalent to 
(6.21) in the present case, as we shall see in a moment. Hence the con- 
dition that the x, process be a Markov process can be written in the form 


(6.21%) Piz, l n eee) Tual R Ea, (Erak 


with probability 1.. This is a condition involving only variances and 
covariances, according to (6.20). Hence we shall say that any process, 
Gaussian or not, is a Markov process in the wide sense if E{|2,|?} < 00 for 
all t, and if, whenever ty < ` + * < t,, (6.21’) is satisfied with probability 1. 
We still must justify, for the original Gaussian x, process our statement 
that (6.21) implies (6.1). For this process, the difference 


we — Efe, [Eet + +, Te 


is a Gaussian variable, with mean 0, and is orthogonal to and therefore 

independent of w,,°- -, Sı Hence the conditional distribution of x; , 

given that ʻ 
x, (0) Ses aay t (0) = fn 

is that of 


(6.22) Y= ty Efe, EAO x, (0) = &,4} ‘ 


§7 MARTINGALES 91 


which is Gaussian, with mean a certain linear combination of $}, * + +, &,_4 
[the last term in (6.22)], and variance that of y. The conditional distri- 
bution of x, is thus entirely determined by the conditional expectation, 
and (6.21) implies (6.1), in the present case, as was to be proved. 

In V §8 a simple condition will be derived which is necessary and 
sufficient that a process be a Markov process in the wide sense. Note 
that a process which is a Markov process in the strict sense is not neces- 
sarily one in the wide sense, even if the expectations of the squares of the 
variables involved exist. 


7. Martingales 


A stochastic process {a,, t¢ T} is called a martingale if E{|x,|} < o% 
for all ¢ and if, whenever n > 1 and h <- + *< thi, 


(7.1) Ee, alleys. x, } = t, 


with probability 1. This is a strict sense definition, and martingale will 
always mean martingale in this strict sense. A stochastic process with 
variables {x,} is called a martingale in the wide sense if E{|z,|®} < 00 for 
all ż and if, whenever n > 1 and 4 <* + + < tn} 


(7.1') Pirr (RR 7 = Ba 


with probability 1. Applying the rules of combination of E and £ (cf. 
IV §3 for the derivation of these rules for Ê), it is seen that a sequence of 
random variables 2, v2, * * + is a martingale if and only if 


(7.2) E{la,|}<,n>1,  Eftnia |e? 5 En} = en 
with probability 1, a martingale in the wide sense if and only if 
02) Efe} < onzi, Bena fey + en} = en 


with probability 1. 
If Y1, Ya, * * * are defined as 


Y= Ay 1 Mere ae Oey 
then, if the x, process is a martingale, 
(7.3) E{lynl}< ©, Eyn ls Y= nel, 
with probability 1. The 2,’s are thus partial sums of the series Dh 


n 
where the y„’s satisfy (7.3). Conversely, the partial sums of any such 
series constitute a martingale. If the a, process is a martingale in the 
wide sense, the corresponding property of the y,’s is that these random 
variables are mutually orthogonal. 


92 DEFINITION OF A STOCHASTIC PROCESS II 


A Yn process satisfying (7.3) has independent interest and merits further 
examination. The condition (7.3) lies between zero correlation and 
independence of the y,’s. In fact, the condition of zero correlation of 
the y,,’s is i A 

i EYnTn} = EYmMEGn} om An. 
The condition (7.3) can be reformulated as follows: if P(y,, + * * y,,) is 
any bounded Baire function of the indicated variables, 
(74) fyny * Yb =0, nel 
This is, of course, stronger than zero correlation. [Note that E{y,} = 0 
if (7.3) is true.] On the other hand, the condition of independence of the 
Yp'S is equivalent to the still stronger condition that for every ® as above 
and every bounded Baire function V(y,,,1) Of Yn+1- 
OD (PUDY Y} = ELODIE,» + Yd} 


A Markov process involves a stronger restriction, in one sense, than a 
martingale, since the Markov property involves distributions rather than 
expectations; on the other hand, a Markov process need not be a 
martingale. 

Example | Let , &, &, + * + be any random variables with 


E{|7|} < œ. 
Then, if x, is defined by 
(7.6) E, = Efn | ént + + Ends 


the a, process is a martingale. In fact, 
(D) Eling lEn - Fa} = ESE Es a End | Ee rE 
= Efn | fnt + En} = om 


with probability 1. Hence, since x}, + -, &„ are random variables on the 
sample space of &,- + +, &,, 


(7.8) E(t yea [2° dm Sas) En = E le En = Bee 
with probability 1. Taking the conditional expectation of both sides of 
(7.8) with xi, ++ +, w, fixed gives (7.2). 

The corresponding continuous parameter example of a martingale is 
given by 
79) m= Ely |§. 7 <0} 
where the s and y are arbitrary random variables except that 
E{|n]} < 00. 


§7 MARTINGALES 93 


If E is replaced by Ê in this example, the processes derived are martin- 
gales in the wide sense. The proofs are still valid. 
Example 2 If the random variables 2,, a, > - - can be put in the form 


(7.10) 2, =%+:-*+y ,n=l, Eflyj|}<ao, n>1, 


where the y;’s are mutually independent, and E{y,,} = 0, m > 1, then the 
x, process is a martingale (also a martingale in the wide sense if 
E{|y,,|?} < ©.) This is a special case of the general form of a discrete 
parameter martingale already discussed. The continuous parameter 
version of this example will be defined in §9. 

Example 3 Lety}, Yə ` + * be any random variables, and suppose that 
the distribution of y,, * * +, Yn is given by a Baire density function p,, in 
n-dimensional space. In this way a sequence of density functions 
Pi Po» * ` * is defined. Let qı, qa, * * * be a second such sequence; in the 
following we shall consider the y; process distributions as determined by 
the p,’s. Define the random variable x, by 


a, aY S) 
EER * Yn) 


Note that the denominator vanishes with probability 0. We shall prove 
that the a, process is a martingale if we suppose that q, = 0 whenever 
P, =0. To prove this assertion we shall assume, as we can for the pur- 
poses of the proof according to the representation theory of I §6, that 
Yı Yæ * * + are the coordinate variables in infinite dimensional space. 
Then there is a conditional probability density of Y+ for given yn * * *s Yn» 


(7.11) 


Pnsi(Yvr* * Ym) 
Payo * * Yn) 
so that 


(112) Efesa lY" >Yn} 


o 
= i Cntr 
o 


PaalYo' tt Yn: 
Hen Paypi 


S mao YA) g MU y 
Plyn * > Yn) Pro: * "> Yn) 


n 


Taking the conditional expectation of both sides of (7.12) relative to 


%,° * *, 0, gives (7.2). 
The martingale defined in (7.11) has important statistical applications. 
In statistics the ratio defining x, is called a “likelihood ratio.” Note that, 


94 DEFINITION OF A STOCHASTIC PROCESS I 


if Pali * + +, 2n) and g,(4,, © + +, Àn) are not interpreted as densities but 
as the probability that a set of discrete-valued random variables take on 
the values ,,- + *, A, the x, process defined by (7.11) is still a martingale. 
[The integrals in (7.12) become sums.] A more general type of martingale 
is defined as follows. Let y}, Y2 °° * and 2, 2, - - + be sequences of 
random variables. (The z, sequence is not necessarily defined on the 
same w space as they, sequence.) Let P,,, Q,, be the measures of n-dimen- 
sional Borel sets A defined by 


P (A) = Pilys(@), > + + Yn()] € A}, 
Q,{A) = P{iz (w), - +, 2n(@)] € A}. 


Suppose that Q,,(A) = 0 if P,(A) = 0, that is, that Q,, measure is abso- 
lutely continuous with respect to P,, measure. There is then a relative 
density, according to the Radon-Nikodym theorem, that is, there is a 
Baire function ®,„ of n variables, such that 


Q,(A) = Í ®, dP, 


Define x, by 
= DUn Yn): 


Then the x, process is a martingale. The proof can be carried through 
as in the preceding special case or by direct appeal to the definition of the 
conditional expectations involved. Likelihood ratios will be examined 
in more detail from the martingale point of view in VII. 


8. Stationary stochastic processes 


(a) Strictly Stationary processes A Strictly stationary stochastic process 
{eo t e T} is one whose distributions remain the same as time passes; that 
is, the t EA distribution or the random variables x,,,,° °°, Ky sn 
is independent of h. Here f, > +, t, is any finite set of parameter values, 
and A is chosen so that the lamiie. parameter values are also parameter 
values. 

Example | Let + + +, %, 2, + + + be mutually independent random 
variables with a common distribution function. Then the x; process is 
Strictly stationary, as is any &,, process, where 


Eq = È GnimEms 
a) 
and the a,’s are chosen to make the series converge in probability. (We 


shall see in TI that convergence in probability of a series of mutually 
independent random variables implies convergence with probability 1.) 


§8 STATIONARY STOCHASTIC PROCESSES 95 


A strictly stationary process is subject to the strong law of large 
numbers: if the parameter is integral-valued and if E{|x9|}< 00, then 
(X, Theorem 2.1) 

rep a a 

no n+1 
exists with probability 1, that is, for almost all sample sequences. The 
limit @ is identically constant (with probability 1) in many important 
special cases. For example, if the xs are mutually independent it will 
be seen that 

& = E{x,} 

with probability 1. 

In the continuous parameter case the strong law of large numbers for 
strictly stationary processes becomes: if E{|29|}< 00 and if the process 
is measurable, then (XI, Theorem 2.1) 

t 
shoul 
lim — i E E 


too 
0 


exists with probability 1, that is, for almost all sample functions. 

(b) Wide sense stationary processes The process {x,, teT} is called 
stationary in the wide sense if E{|a,|"} < 00 for t € T, and if E{w,,.#,} = R(t) 
does not depend on s. The function R is called the covariance function 
of the process. Usually the added condition that E{w,} does not depend 
on s is imposed. This condition is unnatural mathematically, and has 
nothing to do with the essential properties of interest in these processes, 
and we shall therefore not impose it. When the added condition is 


satisfied, however, 
E{(%,..— Efte] [2s — E{x,}]} = Efst} — E{wJE{a,, +} 


is also independent of ¢ and the process with variables {x, — E{a,}} is used 
rather than the original process. If this is done the random variables 
determining the process have zero expectations and R(r) is a true 


covariance. 

If a real Gaussian process is stationary in the wide sense, and if Efx} 
does not depend on s, the process is strictly stationary because the 
determining parameters of Gaussian distributions are the means and 
covariances. If a complex Gaussian process is stationary, if E{x,} = 0, 
and if (see Theorem 3.2) E{x,x,} = 0 for all s, t, the process is strictly 
stationary for the same reason. Thus the definition given is a proper 
wide sense definition to match the strict sense definition of a stationary 


process. 


96 DEFINITION OF A STOCHASTIC PROCESS u 


A strictly stationary process is stationary in the wide sense if 
E{|z,|?} < oo for all t. 

In this book, both strictly stationary and wide sense stationary processes 
will be referred to as stationary processes. In the literature, however, 
“stationary processes” sometimes means “strictly stationary processes.” 
“Temporally homogeneous” has been used as a synonym for “stationary” 
but is now uncommon. 

Example 2 Let «+ :, %, Zy * + + be mutually independent random 
variables with 


E{e,}=0, Efel} = è > 0. 
Then the x, process is stationary in the wide sense with 
R(n) = 0 n+#0 


= fol =i) 


The process is strictly stationary if and only if the w,’s have a common 
distribution function. 

A process stationary in the wide sense is subject to the law of large 
numbers, and in fact the theorems stated in this connection for strictly 
stationary processes remain true if “limit with probability 1” is replaced 
by “li.m.”; see X, Theorem 6.1, and XI, Theorem 6.1. 


9. Processes with independent increments 


A process with independent increments is one whose random variables 
{a,} have the property that, if 4 <+ © * < t, (n > 3), the differences 


Ly — Tyr * * Uy — BH, 
are mutually independent. If a, a, * * * constitute such a process this 
means that a,— to, ta— 2%, ` * * are mutually independent; v, — to is 


the nth partial sum of the series > («,, — #,,-1) of mutually independent 


1 
random variables. Conversely, if y1, 2, ` * * are mutually independent 
random variables, if ay is an arbitrary random variable, and if «, for 
n> l is defined by x, = to + Yı +` * * + Yn the x, process is a process 
with independent increments. In practice the term “process with inde- 
pendent increments” is used only in the continuous parameter case. 

A continuous parameter process {x,, 0 <t< œ} with independent 
increments and P{2,(m) = 0} = 1 is for t > 0 the continuous parameter 
version of a discrete parameter process whose random variables are the 
partial sums of a series of mutually independent random variables. Just 
as the latter process is one example of a discrete parameter Markov 


§9 INDEPENDENT INCREMENTS 97 


process and also (if the expectations of the summands all vanish) an 
example of a discrete parameter martingale, so the former process is a 
continuous parameter Markov process and, if E{a,— x,} = 0, is also a 
martingale. 

If the distribution of 2,— x, depends only on ż— s, a process with 
independent increments will be said to have stationary (strict sense) 
increments. 

Examples of processes with independent increments are specified by the 
distribution of x,—a,. If sı < S< 53, the random variables 


Yi = Xs, — Tsp Yz = La, — Vap Ya = Lr, — Va 


will thereby be assigned distributions. Since y; = y; + Ys, and since 
y, and y are mutually independent, the distribution assigned to y, must 
be that of the sum of two independent random variables, one with the 
distribution of y,, the other with that of yə. This consistency condition 
is easily checked in the examples to be given, and serves to insure the truth 
of the Kolmogorov consistency conditions required to insure the possi- 
bility of setting up the stochastic process. Note that it is not necessary 
to assign distributions to the 2,’s themselves, since only the differences 
are usually involved, and in fact the procedure used to set up a function 
space measure is applicable without the æ, distributions. This will mean, 
however, that æ, will not be a random variable, and that the random 
variables of the process are really the differences x,— v, This situation 
is usually avoided by choosing some parameter point f and considering 
the process {a,—«,, teT}, that is, the variables of the process are 
normalized to vanish at tọ. 

Example | Brownian motion process In this case it is supposed that 
x,— æ, is real and normally distributed, with 


E{x,— a} =0 
(9.1) 
E{(z,— xP} = o°|t— s|, 
where ø > 0 is a fixed parameter. The parameter set 7 is usually taken 
as either the whole ¢-axis or the half-axis [0, 00), and in the latter case 
ay is usually defined to be 0 with probability 1, that is, we consider the 
differences «,— 2 as explained above. This process was first discussed 
by Bachelier, and later more rigorously by Wiener. It is sometimes 
called the Wiener process. For fixed tọ the differences {x,— tis t > to} 
constitute a Markov process and also a martingale. 
If microscopic particles are observed in a fluid, they are seen to move 
in an irregular fashion under the impacts of the fluid molecules. The 
motion is called Brownian motion after the English botanist Brown who 


98 DEFINITION OF A STOCHASTIC PROCESS Il 


reported the phenomenon in 1826. Einstein and Smoluchovski showed 
that to a first approximation x(t), the x coordinate of a Brownian particle 
at time f, defines a function of t for each particle motion which can be 
identified with a sample function of a Brownian motion stochastic process. 
The constant o? depends on the mass of the particle and on the viscosity 
of the fluid. 

Tt will be shown in VIII §2 that the sample functions of a separable 
Brownian motion process are almost all continuous functions, and in fact 
(VII §7) this is essentially the only process with independent increments 
having this property. 

Example 2 Poisson process Here it is supposed that for every pair 
S < t, x(w) — x(w) is integral-valued, with 


etde — s)" 


7! 


(9.2) Pf{x,(w)— z(u) = r} a 0,11, 2) 8 eet, 
where c >O is a fixed parameter. A Poisson process has stationary 
increments (strict sense). For fixed tọ the differences {x,— x, t> to} 
constitute a Markov process, and the differences {x,— a, — ct, t > to} 
constitute a process which is both a Markov process and a martingale. 

It will be shown in VIII §4 that the sample functions of a separable 
Poisson process are (almost all) monotone non-decreasing, increasing in 
isolated jumps of unit magnitude. The points where these jumps occur 
can be considered the times when some sort of random event occurs, and 
%,— X, is then the number of events that have occurred by time t, begin- 
ning the count at time f). The constant c is the expected rate of occur- 
rence. With this interpretation it is shown that the conditional distribu- 
tion of events in an interval (s, t), if it is known that n have occurred in 
the interval, is that of n points chosen independently in (s, t), each choice 
uniformly distributed over (s, t). The Poisson process has been found a 
good approximation to the process governing the times of emissions of 
radioactive material. Each sample function can also be considered as 
defining an infinite sequence of values of 1 (positive and negative if the 
condition ż > 0 is dropped), the points where the jumps occur, and it is 
such sequences that are usually in the minds of those who speak of a 
uniform distribution of points over an infinite interval or of a “purely 
random” sequence of points on such an interval. 

Almost all sample functions of a separable Poisson process are continu- 
Ous except at the points of an at most enumerable parameter set, and even 
at the points of discontinuity left- and right-hand finite limits exist. This 
assertion is true of the general process with independent increments after 
a suitable centering has been accomplished by replacing x, by a,— f(t), 
where f is a function of t but not of w. 


$10 UNCORRELATED OR ORTHOGONAL INCREMENTS 99 


10. Processes with uncorrelated or orthogonal increments 
A process {2,, t e T} is said to have uncorrelated increments if 


(10.1) E{|z,— 2,)7}< 00, 5,teT, 


and, if whenever parameter values satisfy the inequality sı < 4 < S2 < fy 
the increments x, — #,, and x,,— 2,, are uncorrelated with each other, 


(10.2) E{(x,, — LNE — %,,)} = Efa,,— 2, }E{x, — %,,}- 


A closely related type of process is a process with orthogonal increments, 
that is, one for which (10.1) is true, and for which (10.2) is replaced by 


(10.3) E{(x;,—%,,)(@,— %,)} = 0. 
If an x, process has uncorrelated increments, the y, process determined by 
yi = tı Efe} 


is a process with uncorrelated and orthogonal increments. 
If {£n n > O} is a process with uncorrelated [orthogonal] increments, 
o 


x, — ap is the nth partial sum of the series S (&m— tm) Of mutually 


1 
uncorrelated [orthogonal] random variables. Conversely, if Y1, Yo, * * * 
are mutually uncorrelated [orthogonal] random variables, if æ is an 
arbitrary random variable, and if x, for n > | is defined by 
Ea = to Hy H't Yw 

then the x, process is a process with uncorrelated [orthogonal] increments. 
In practice the term “process with uncorrelated [orthogonal] increments” 
is used only in the continuous parameter case. 

If the increments of a process satisfying (10.1) are stationary as far as 
second moments are concerned, in the sense that 

E{|x,— z.l’} 

depends only on ż— s, the process is said to have stationary (wide sense) 
increments. An‘elementary calculation shows that in this case the expec- 
tation — 
E{(x,— £i) (@u— %)} 


depends only on the differences — s, u— t, v— t. 
If a process with uncorrelated or orthogonal increments is Gaussian, 


ith 
ui E{x,} = 0, 
and if it is real, or if (see Theorem 3.2) 
E{x,x;} = 0, 


100 DEFINITION OF A STOCHASTIC PROCESS I 


then the process has independent increments. Thus the processes with 
uncorrelated or orthogonal increments are processes with independent 
increments in the wide sense. 

If a process has independent increments, and if (10.1) is true, the process 
also has uncorrelated increments. Thus the Brownian movement and 
Poisson processes defined in §9 are processes with uncorrelated incre- 
ments, and in fact with stationary (strict as well as wide sense) increments. 

Let {x,, t e T} be a process with orthogonal increments. Then F(t) can 
be defined to satisfy 


(10.4) E{|a,— «,|?} = F(t)— F(s), Sat 
For example, we can take any fy e T and define 

F(t) = E{|a,—x,,"}, (Sap 

=— Ei{lz,— tal} t< fy. 


The function F is monotone non-decreasing, and is determined by (10.4) 
up to an additive constant. The process has stationary (wide sense) 
increments if and only if the difference F(t)— F(s) depends only on t— s. 
In the stationary case denote this difference by F,(t—s),s< t. Then 


F(t; — t) = Fi(t3— h) + F(t,— h), Kok sh. 


Let T, be the set of differences t— s with s, te T ands < t. Then — is 
a monotone non-decreasing function, defined on T}, and satisfying the 
functional equation 


F(u +v) = Fu) + AQ), u,v eT. 


The only monotone solution of this functional equation, if T, is an interval, 
is 
F(s) = const. s. 


Thus, if T, is an interval, the x, process has stationary (wide sense) incre- 
ments if and only if 
F(t) = const. + o°t 
for some o > 0. 
If the x, process has uncorrelated increments, m(t) can be defined to 
satisfy 
E{a,— z} = m(t)~ m(s) 


and the x,— m(t) process will then have orthogonal increments. 
The Brownian movement process defined in §9 is a process with 
stationary uncorrelated (and orthogonal) increments. The mean and 


$10 UNCORRELATED OR ORTHOGONAL INCREMENTS 101 


variance functions m and F are, if the arbitrary additive constants are 


chosen properly, 
E{x,— xo} = m(t) = 0 


E{(x,— %)*} = F(t) = o*t. 
The Poisson process defined in §9 is a process with stationary uncorrelated 
(but not orthogonal) increments, with 
E{x,— x} = m(t) = ct 
E{|z,— a%— ctf} = ct. 


Processes with uncorrelated and orthogonal increments are essential 
tools in the study of stationary processes, and will, therefore, be studied 
in IX before stationary processes are taken up in X and XI. 

It will sometimes be convenient to write (10.4) symbolically in the form 


E{|dx,|2} = dF). 


CHAPTER: TI 


Processes with 
Mutually Independent 
Random Variables 


1. General remarks 

In this chapter the random variables will be real-valued. The extension 
of the results to complex-valued random variables will however be obvious. 

As already remarked in II §4, processes with mutually independent 
random variables are only useful in the discrete parameter case. In other 
words the useful case is typically a sequence x, 2, ** + of mutually 
independent random variables. The process is characterized by the 
distribution functions of the individual variables, because of the inde- 
pendence property. 

One of the most striking properties of these processes is the zero-one 
law, which will be applied frequently in the following sections. 

THEOREM 1.1 (Zero-one law) Let x, to, ' `` be mutually independent 
random variables. Then, if A is an w set which is measurable on the sample 
space of £a, nyis ` ` * for every n, it follows that P{A} = 0 or P{A} = 1, 
and, if y is a random variable measurable on the sample space of tp 
Lay, °° * for every n, it follows that there is a constant c such that 
P{y(o) = ch = 1. 

This theorem is usually stated somewhat loosely as follows: let 
Xy te, * * + be a sequence of mutually independent random variables. 
Then, if A is an event dependent on 2,, %q4,, °° * for every n, A has 
probability either 0 or I, and, if y is equal with probability 1 to a function 
of these same variables for every n, then y(w) = const., with probability 1. 

The second assertion of the theorem follows readily from the first, 
using the fact that, if y is a random variable satisfying the hypotheses of 
the second assertion, the œ set {y(w) e A} satisfies those of the first, for 
every Borel set A. We therefore discuss only the first assertion. Suppose 
then that A is an w set with the stated properties, and let Y be the class 
of measurable w sets M with the property that 


(1.1) P{AM} = P{A}P{M}. 
102 


$1 GENERAL REMARKS 103 


Then, according to our hypotheses on A, for each n the class Y includes 
every w set measurable on the sample space of a, © * :,%,. Let Fy be 
the class of sets measurable on the sample space of x, +: -, x, for 
some n. Then Fy is a field, and we have just shown that FC Y. 
The class Y obviously includes the limits of every monotone sequence of 
G sets, and therefore (Supplement, Theorem 1.2) includes the Borel field 
generated by Fy. This Borel field is the class of œ sets measurable on 
the sample space of tı, £a © ©» Hence A e Y, so that 

(1.2) P{A} = P{A}?, 

and it follows that P{A} = 0 or P{A} = 1, as was to be proved. 

The hypothesis of mutual independence of the v„’s served only to 
insure the truth of (1.1) for Me Fp . If we drop this independence 
hypothesis, and in fact only suppose that A is a measurable w set with 
the property that for each n 

P{A |2" + +5 n} = PLAS 
with probability 1, it is still true that (1.1) follows, for M e Fo, and the 
proof of (1.2) then goes through as in the above special case. However, 
even in this more general form the theorem is a trivial corollary of a 
martingale convergence theorem (VII, Theorem 4.3). 

Before applying the zero-one law we give an example to show the 
possibilities if the x„’s are not mutually independent. Suppose that 


x(w) = z0) =: + 
with probability 1. Then y = 2; satisfies the condition of Theorem 1.1, 
but this random variable need not be a constant, since it is completely 


unrestricted. 
As a first application of the zero-one law, consider the following 


problem. Let My, Mo, - : « be measurable w sets, and let A be the w 
set of those points in infinitely many M,’s, 
A= UM, 
n=1 j=n 
The problem is to evaluate P{A}. In the less prosaic language of events, 
the problem is to evaluate the probability that infinitely many events 
(represented by the M,’s) occur. Define x, by 
x(w) = 1, w eM; 
*#=0 otherwise. 


Then obviously A has the property presupposed in Theorem 1.1. If the 
M,’s do not represent independent events, that is to say, if the x;’s are 
not mutually independent, then P{A} may be any number between 0 


104 MUTUALLY INDEPENDENT RANDOM VARIABLES m 


and 1. However, according to the zero-one law, if these events are 
mutually independent then P{A} must be 0 or 1. The following theorem, 
usually called the Borel-Cantelli lemma, gives a criterion for each case. 

THEOREM 1.2 Let M,, Mg,- - - be measurable w sets, and let A be the 
set of points in infinitely many M?s. If XP{M;} < œ, then P{A} = 0. 
Conversely, if 2P{M,} = © and if the M,’s are mutually independent, then 
P{A} = 1. 

If EP{M,} < œ, then 


U3 — P(A} <PÚM}<SPMM})>0 (n> 0), 


so that P{A}=0. Now suppose that the M; events are independent. 
Since the contrary of infinitely many events occurring is that none occurs 
after the nth for sufficiently large n, 


Co 1— P{A} = Jim RA, Mj} = lim Ț [1 — PM), 


n> 


where M,’ is the complement of M,. If X P{M,} = œ, the infinite 


roduct 
Tu? 
must diverge to 0, which implies that the limit on the right in (1.4) is 0, 
so that P{A} = 1. 

An extension of the zero-one law and the Borel-Cantelli lemma, due 
to Lévy, will be discussed in VII. 

A related result is the following. Let M,, Mg, © « « be independent w 
sets in the above sense. Let N be the expected number of M, events that 
occur, that is, N is the expectation of a random variable which at each œ 
is defined as the number of M,’s containing o, 


N = X P{M}}. 
We wish to compare N with the probability P that at least one M, event 
occurs, P= P{UM). 


According to the following inequality N and P approach 0 together. In 
fact, since p; = P{M;} < P, 


P<N=Yp,=—Y [log (1— p) + O(p>)] 
J I 


<—log] | (— p) + NO (L.U.B. p;) 
1 ki 


< — log (1— P) + NO(P) = P + O(P?) + NO(P), 


somat P< N< P + O(P?) = O(P). 


§2 SERIES 105 


2. Series 

The key to the study of sums of mutually independent random variables 
is the fact that, roughly speaking, partial sums cannot be large unless the 
total sum is. The following two theorems both express this fact. 


THEOREM 2.1 Let Y}, ` * *, Yn be random variables with 
Efy;} = 0, Etyjyx} = 0 J#k 
=o, j=k 


n= ty, Ee} soft: ++ of = 47, 
Then, if € > 0, 


(2.1) P{lx,(0)| > e} < 

and if in addition 

(2.2) Ey; lys: y- = j>l, 
with probability 1, then 

(2.1’) Rivas |x,(w)| >} < = 


The first part of the theorem-is simply an application of Chebyshev’s 
inequality, which requires no proof here. The extra hypothesis (2.2), 
which is that the a, process is a martingale (see II §7), is certainly satisfied 
if the y,’s are mutually independent, and Kolmogorov’s proof of (2.1’) in 
the latter case goes through without change in the general case, as follows. 
Let |a,(w)| be the first |a,(@)|, if any, which is > e. Then 


23) 62=fe2dP> > [att Ne, ze + (En a) dP. 
a I=) ((o)=3) 
Now if z; is defined by 
zw) = 2x,(w), xo) =j, 
=0, Wo) Æj, 


the integral of the second term in the bracket above becomes E{z,(x,— 2,)} 
and, since z; depends only on y; * * *, Yj 


Efe, — =} = E[Ee(e,— z) Lyn" +93] 


= Efe Eys ERDAS 93) =0, 


106 MUTUALLY INDEPENDENT RANDOM VARIABLES 1 
by (2.2). Hence 


68> 5 f zdP>e S Poo) =) 
I=L ((@)=3} de, 
= &P{Max |z,()| > e}, 
jan 


giving the desired inequality. See VII, Theorem 3.2, for a generalization 
of this result. 

THEOREM 2.2 Lety, * ` +, Yn be mutually independent random variables, 
and let z; =Y +'*' +Y, Then, if x,— ty, &n— tz ` * * have sym- 
metric distributions, 


(24) 2Pfexr,(o) > 44 2}—2 S Plyo) > e} < P{Max x(w) > 2} 
j=1 jsn 
< 2P{x,(w) > 4}, 
for every 2> 0, and every e>0. The right-hand half of (2.4) remains 


valid if each x, — 2, is supposed to have zero median, but not necessarily 
to be symmetrically distributed. 


Clearly 
(2.5) P{Max x(w) > A, x,(w) = 2} = P{x,(w) = 2}. 
isn 
On the other hand, using the hypotheses of symmetry and independence, 
if x(w) is the first x(w), if any, which is > A, 


2.6) P(Max a(o) = 24(0) <2} 

~"S Pio) = 24(0) < A) 

< > Po) = k, 2,(w) — x,(w) < 0} 
~'S' Pio) = KPE) — z0) <0} 


= > P{r(w) = k, #,(w) — x(w) > 0} 
=1 


S'E Pio) = k, 20) > 4) 
=1 


<Pfe,(v) > 4}. 


§2 SERIES 107 


Add (2.5) to (2.6) to get the right-hand half of (2.4). Note that even if 
2, — x; is not symmetrically distributed, but if at least 


P{x,(w)— a,(o) >0}>4, k=1,-+ +01, 
so that 
Pizo) — zo) > 0} > P{e,(w)— ro) <} k=l, n=l, 


then (2.6) remains true except that in the fifth line the (first) equality 
sign must be replaced by “<” and the “>” must be replaced by oe 
In particular, if every difference x, — 2, has zero median, (2.6), and 
therefore also the right-hand half of (2.4), remains true. To derive the 
other half of (2.4), note that, if A, e > 0, and if each x, — 2, is symmetric- 
ally distributed, then 


(2.7) P{Max z(o) > A, z(o) < 2} 
jan 


a 3 P(w) = k, &,() < A} 
=1 


> 5 P{n(w) = k, (w) — vw) < e} = P{y,(w) = e 
p= 1 


=l 


= os P(r) = k, z(o) — xlo) > e} — pi P{y,(w) = e} 
Ket S 

>'S PHO) = k z(u) ZA + 2—25 Plante) > 9 
ka = 


> Pl2,(0) > 4+ 2) — Pinto) > 2S, Plno) > 4 
T] 


Add (2.5) to (2.7) to get the left-hand half of (2.4). 

The right-hand half of (2.4) is the most important one; the left-hand 
half will be used in this book only in VIII. There is one important case 
in which a more precise evaluation can be obtained. If the y,’s only 
take on the values + 1, with probability */s for each, and if N is an integer, 


(2.8) P{Max x(w) > N} = 2P{x,(w) = N}— P{x,(o) = N}. 
jan 


This can be seen by the appropriate modification of (2.6), summarized as 
follows (reflection principle of D. André): if Max x(w) = N, there is a 


jan 
first x(w), say x(w), which reaches M. From there on, any succession 


108 MUTUALLY INDEPENDENT RANDOM VARIABLES 1i 


of x(œ) values culminating in an x,(@) < N has the same probability as 
the succession reflected in the line x(w) = N, which culminates in an 
x,(w) >N; that is, 
P{Max «,(w) > N, x,(@) < N} = P{Max zw) > N, x(w) > N} 
jan jen 
(= P{x,() > N} 


and this, combined with (2.5), gives (2.8). This evaluation is used in the 
problem of ruin. The method is also applicable of course if the y,’s take 
on only the values + e (instead of + 1), and from this it is but a step to 
get a precise evaluation in the limiting case when the parameter j is 
continuous, when a, is replaced by x, and the x, process is the Brownian 
movement process. The essential point is that there must be a first value 
of the parameter at which the value 4 is actually attained. This is true 
if the #,(@)’s only change in integral multiples of a given number £, as 
above, choosing A as a multiple of e, or, if the parameter t is continuous, 
it is true if the sample functions are continuous functions of t (cf. VITI §2). 


THEOREM 2.3 Let y;,Y2,* * ` be mutually independent random variables, 
oo o 

with variances o,2, 0,2, °° +. Then, if > 0,2 = 0% < © and if > E{y,} 
1 1 


o 


converges, >. y, = x is convergent with probability 1 and also in the mean. 
1 


Moreover 


(29) EQ} = $ Eluh — Efat}— Efe = ot 


and if x, = 5 [y — Ey) 


2 
(2.10) PLU. |2,(0)| =} < - 


n A o 
Conversely, if li.m. X y; = æ exists, the series > o; and > E{y,} converge. 
1 1 3 


n=>o 


By Theorem 2.1, if m is fixed, 


ID PME eO oo) ae. S 
m+1 


m<nsm+r & 


1 ao 

= of. 
abaa 
When r —> © this becomes, if ¢ is a point of continuity of the distribution 


function of L.U.B. |a,,— 8ml» 
n>m 


(2.12) P{L.U.B. |2,(0) — xq(o)| > ef <5 Sof; 


n>m m+1 


§2 SERIES 109 


this inequality is then true for all e >0, by a continuity argument. 
Hence (¢—> o) the inferior and superior limits of the sequence {x,} are 
finite, with probability 1, and 


(2:13) “Pt gupia(o)— imine) 2 2h = SO 
E 


na no m+1 


Thus (m —> 00) the inferior and superior limits differ by at most 2e, with 
probability 1, for every £ > 0 and are consequently equal with probability 


1; that is to say, the series > [y; — Efy,}] converges with probability 1, 
1 


as does S Yz since S E{y,} converges. Inequality (2.10) is a special case 
1 al 
of (2.12) with m = 0, defining vy = 0. There is convergence in the mean 


of > [y;— Efy,}] since 
1 


n 
lim Ef|en— tm} = lim > 0? = 0, ~ 


m, n> m, n>% m+l 


and this together with the convergence of > E{y,} implies the convergence 
1 


w 
in the mean of X y; Itis a general property of convergence in the mean 
1 


that Lim. z, = z implies that lim E{z,} = E{z} and lim E{z,°} = Ef}, 


n> oo noo NO 


and this general property gives (2.9). Conversely, if 5 y,, = x, where the 
1 
w 
series converges in the mean, > Efy„} converges to Efx} in accordance 
1 


with this same property. Hence Spe Efy,}] also converges in the 
1 


mean, that is, Lim. a, exists. Then the evaluation just given of 


n+ 
a 
E{|2,, — «,,|2} shows that Dd aP< o. 
1 
Since the differences {y; — E{y,}} are mutually orthogonal, this theorem 
can be considered a theorem on a very special type of orthogonal series; 


it is the strict sense version of IV, Theorem 4.1. 
Theorem 2.1 can be interpreted to state that convergence in the mean 


of Sy, implies convergence with probability 1. The basic fact about 
1 


series of mutually independent random variables is that almost any 
limitation on the spread of the partial sums, such as convergence in the 
mean, implies convergence with probability 1. Before elaborating on 


110 MUTUALLY INDEPENDENT RANDOM VARIABLES m 


this point we note that the set of sample sequences of the xs for which 
there is convergence is defined by conditions on the #;’s for large j. In 
other words, the zero-one law (Theorem 1.1) is applicable, and states that 
there is convergence (to a finite limit) with either probability 1 or 
probability 0. 

Let Y1, Yz * * be a sequence of mutually independent random variables. 


o 
If there are constants c}, C2,‘ * * such that > (Yn — ĉn) converges with 
1 


probability 1, the series > y,, will be said to converge with probability 1 


Y 
when centered, and C, C» ‘+ ° will be called centering constants. If 
cy, Cx, + * + is another sequence of constants, the c;’s will be centering 


a o 
constants if and only if > (c,—¢,’) converges. If >y, converges with 
1 1 
probability 1 when centered and if there are centering constants ĉ, ¢:,° * * 
o 
with the property that > (yn — ¢,) converges with probability 1, for any 
1 


ordering of the terms in the sum, é,, ¢,* > * will be called absolute centering 


constants. If ¢/, ĉ,* + + is another sequence of constants, the ¢,’s will 
o 

be absolute centering constants if and only if > |é,— é,(| converges 
% T 

(since > (ĉa — é,’) must converge for any ordering of the terms). It will 


1 
be shown in Theorem 2.6 that when there are centering constants there 
are always absolute centering constants (and that the sums using these 
constants are independent of the order of summation). As an example 


w 
suppose that y,, has finite variance o,*, and that >o,2< 0. Then 
1 
according to Theorem 2.3 E{y,}, Efyə}, - + + are appropriate centering 
constants. Since > a,2 converges with any ordering of the terms, the 
1 


same is true of Š 

2 [yn — Efyn)]. 
This means that Efy,}, Efyə}, * + * are absolute centering constants. On 
the other hand, 0, 0, - + are centering constants for the series if and only 


if > E{y,} converges, and are absolute centering constants if and only if 
1 


2 [E{yn}| converges. 
1 


Even if the variances are not finite, appropriate centering constants can 
always be written down explicitly. For example, an evaluation of the 


§2 SERIES 111 


nth centering constant given in Theorem 2.6 makes it depend only on the 
nth summand. 

The following theorem is a weakened converse to Theorem 2,3 in so 
far as the latter pertained to 2 a od 1 convergence. 

THEOREM 2.4 Let yy, Yas ` * * be mutually independent random variables 
which are uniformly bounded, |Yn| < e, with variances Piles Sate 


Then, if Son converges with probability 1, the series > o, and 5 Efyn} 
converge. 
Let ®, and Ọ be, respectively, the characteristic functions of y,, and 


Son =æ. Since the distribution of Šu converges to that of a, 
1 


TT ean > (t) uniformly in every finite t interval. Then, applying 
1 


1 (11.14), 
2 3 1 
Sat<— flog Oo), lap 
1 
and the right side is finite if ¢ is sufficiently small. Then by Theorem 2.3 


xh [yn — Efy,}] converges with probability 1. Since SV, also converges 
1 


with probability 1, 5 Efy„} must converge. 


THEOREM 2.5 (Three series theorem) LetYı, Ya ' * * be mutually indepen- 
dent random variables, and let y }, {bn} be sequences of numbers, with 


(2.14) 0< lim inf}, "Ss lim m sup {p" TEG, 
Define yx by 
Yx (0) = Yo) — an S Yao) S br 
= otherwise. 


Then Su, converges with probability 1 if and only if the series 
1 


215) Š Pio) Ayo} > Ely). Š El," — Elyn} 
converge. 


If Su converges, lim y, = 0, so that y,(@) = yn (w) for large n. 


n> D 


If these statements are true with probability 1, the first series in (2.15) 


112 MUTUALLY INDEPENDENT RANDOM VARIABLES 1 

converges, by the Borel-Cantelli lemma, Theorem 1.2, and the convergence 

of the other two series in (2.15) is proved by applying Theorem 2.4 to 

the series X y,’, which also converges with probability 1. Conversely, 
1 


suppose that the three series in (2.15) converge. Then, by Theorem 1.2, 
Yn(@) = Yn(@) for large n, with probability 1, and the convergence of 
S y, with probability 1, which is implied by the convergence of means 
1 nD 
and variances, implies that of DU. 
i) 1 
If >y, is a series of mutually independent random variables which 
1 
converges with probability 1 when centered, it is not obvious that a series 
of the same summands in a different order also converges with probability 


1 when centered. This implication is correct, however, and can be seen 
in various ways. For example, we shall show (Theorem 2.7) that the 


series X y,, converges with probability 1 when centered if and only if the 


1 
infinite product whose factors are the absolute values of the characteristic 
functions of the y,’s is everywhere convergent. Since this convergence is 
independent of the order of the factors, the stated property of the series 
O 


> Yn is independent of the order of summation. Another way of proving 
1 


this same fact is to show that there is a set of absolute centering constants 
whenever there is a set of centering constants. This fact is contained in 
the following theorem. 
THEOREM 2.6 Let y, Y2 * * ` be mutually independent random variables 
æ 


and suppose that 5 y, converges with probability 1, when centered. 
1 


(i) There are always absolute centering constants, Ĉi, Cy, ** °- For 
example, if m, is a median value of Yn if % > 0, and if 


Yn (0) = Yu) [Yq m| S 2, 


= Mp lyo) — m,| > %, 
then we can take é, = Ey,/}- If yrs yor * ` + are symmetrically distributed, 
0, 0, > + + are absolute centering constants. M 
(ii) If &) y ` > ` are absolute centering constants, the sum X, (Yn — A) 


1 
is independent of the order of the terms, neglecting zero probabilities, and 
w 
any Subseries DY, has é,, 65° °° as absolute centering constants. 
1 
Proof of (i) Let cp ca’ ` be centering constants for the y,,’s, so that 
© 


ND 


S (Yn— cn) converges with probability 1. Then lim (m,,— cn) = 0. 
1 


§2 SERIES 113 


The definition of y,’ amounts to cutting off (y,— Cn) % units below and 
above its median value m,— Cn, that is, for n large very nearly « units 
below and above 0. This is a legitimate cutoff of Yn— Cn in accordance 
with the conditions of Theorem 2.5, and yields the truncated expectation 


E{y,/}— cn Hence by Theorem 2.5 S [Ey,4— c,] converges. There- 
fore Efy;'}, E{ys}, °° are cence constants for the series, and 
Sone Efy,’}] converges with probability 1. We shall use below the 
fact, implied by this, that m, — Efy,/} > 0. Now define fn = Yn— Ely, } 
so that 3 Ü, converges with probability 1, and apply Theorem 2.5 to this 


series, truncating ý, outside the closed interval [m,— Efyn}— % 
m, — Efy„ } + a] to get Dn! = Yn — Efyn} This truncation satisfies the 
conditions on a, bn in Theorem 2.5 because m, — Efy, 3 —> 0. According 
to Theorem 2.5, 


SPO) F Gn (O)} <2, EE) < o 
1 1 


and we have also Efĝ„} = 0. Since the series written here are absolutely 
convergent, and since the convergence of these series gives sufficient 
conditions in Theorem 2.5, the series 


3h. = Sty.— Bly 


converges with probability 1, regardless of the order of the terms in the 
sum. The constants E{y;’}, Efyz} ` ` * are therefore absolute centering 
constants. In particular, if the Yn are symmetrically distributed, 
Efy,/} = 0 so that 0, 0,  : -are absolute centering constants. Note that, 
since every subseries of either of the above two series of constants con- 


verges absolutely, every subseries of > [yn— Efyn}] converges with 
1 
probability 1, regardless of the order of the terms in the sum. Thus each 
subseries of > Yn has the corresponding subsequence of {E{y,,}} as sequence 
1 


of absolute centering constants. 


Praof of (ii) If 4 Cy ONE are absolute centering constants, 
w% o 
> |é,—Efyn}| < ©. Any subseries > y,, has Ê> Gr * °° as absolute 
1 1 


o 
centering constants because > |é, — Ely, 3| < © and because we have 
1 


114 MUTUALLY INDEPENDENT RANDOM VARIABLES m 


seen in (i) that Efy,/}, Efy,,}, * > > are absolute centering constants for 
the subseries. Finally, to prove that the sum Soe é,) is independent 
of the order of the terms, it is sufficient to eke ĉn = Efy,’} because the 
sum of the absolutely convergent series > [ĉn — Efy,’}] is independent of 


the order of the terms. Thus we must prove that, if 7, n» * * * is any 
permutation of the natural numbers, 


(2.16) EI = Eh, 


with probability 1. Since the interchange of a finite number of terms of 
a series does not change the sum, we can modify the above series, if 
convenient, so that n; = j, j = 1, + +, N, where Nis arbitrary. We prove 
first that 


(2.16/) 29) = 2 In, 


with probability 1. To prove this it is sufficient to prove that the series 
have the same limit in the mean sums, which follows from the evaluation, 
(applying (2.9) to the y’;’s) 


5 


EUSI S= SAE >O r>. 


1 


Then (2.16) is true, and (2.16) follows from 


o 


PE HO) AS I< È POO) AGO) Š PONO) Aa O) 


=2 $ Po) Aio (N> 0), 
N+1 


where we have assumed, as we have already remarked we can, that the 
first N of the n;s have been changed into 1,- - -, N. 
Note that in proving this theorem we have not proved that the series 


> Yn— ĉn) converges absolutely with probability 1, and in fact this may 
T 


be false. The following theorem distinguishes between the different 
possibilities in terms of the characteristic functions of the y,’s. The 
conditions are stated in terms of certain infinite products. We recall 
that, if an infinite product is convergent, its value is 0 if and only if one 
of the factors vanishes. This means, for example, that if {®,,} is a sequence 


§2 SERIES 115 


o 
of characteristic functions, the infinite product TT |®,(0)| has a value 


for every t, even though the product may ee for some values of t. 
The value of the product will be 0 at all points of divergence, and at any 
further points where one of the factors vanishes. 

THEOREM 2.7 Let yy, Yo," °° be oe i independent random variables, 
with characteristic functions te D,, ° 


(i) If Son converges with probability 1 when centered, then al 1 KAI 
is continuous, is | when t = 0, and this tE product converges uniformly 
in every finite t interval. Conversely, if TI |, (| > 0 on a t set of 
positive Lebesgue measure (or slightly more generally if this infinite product 


converges on a t set of positive Lebesgue measure), the series > Yn converges 
1 


with probability 1 when centered. 


(ii) If Ss Yn converges with probability 1 (that is if 0, 0, * > > are appro- 
1 


priate centering constants), TT ®,, converges uniformly in every finite t 


interval. Conversely, if this infinite product converges on a t set of positive 


Lebesgue measure, > Yn converges with probability 1. 


(iii) Tf Son converges with probability 1 regardless of the order of 
erausian * (that is, if 0, 0, + + + are appropriate absolute centering con- 
stants), Y, |®,— 1| converges uniformly in every finite t interval and the 


Weiersiease M test criterion is applicable. Conversely, if > \®,—1| 


converges on a t set of positive Lebesgue measure, Son converges with 
probability 1, regardless of the order of summation. 

Proof of (i) If AA converges with probability 1 when centered, let 
Ci Gpo eD peers constants. Then 3 a C,) converges with 


probability 1, so that 
lim a ®,,(t) en" 


Noo 


116 MUTUALLY INDEPENDENT RANDOM VARIABLES 1I 
exists uniformly in every finite interval, and defines the characteristic 
N 


function of X (y,—¢,). Then lim m |®,,| exists uniformly in every 
1 No 1 


w% 


finite interval, and in} |®,,| is continuous, and is 1 when ¢ = 0. Since 
1 


lim > (Yn — Cn) = 0, 
N 


N> 


with probability 1, it follows that 
o 
lim | |o= 
VS AN 
uniformly in every finite ¢ interval, so that the infinite product TI KA 
converges uniformly in every finite ¢ interval. Conversely, suppose that 


the f-set A,, where J le,0| >0, has positive Lebesgue measure. 


1 
Since ®,(— rt) = ©,(t), A, is symmetric int = 0. There is then a bounded 
set A of Lebesgue measure p > 0, and a positive number a such that 


GLB. [| |®,Q|>0, 1A CIO, a]. 
ted 1 


Suppose that « > 0, and define y,,’ by 
Yn (0) = Yn), |Yn() — m| < a 


= M, lyno)— m,| > a, 


where m, is a median value of y,,. Then, by I (11.8%, 
È PO) yO) = $ Pilyo) — m| > 2} 
<— 4L,(a, p, a) f log TT |®,(2)| dt < œ, 
A 1 
and, by I (1 1.9), if y„ has variance o,?, 
> 0,2 <— 2Li(% p, d) | log TT |, (| dt < 0. 
A 1 


The three series theorem (Theorem 2.5) then states that the series 


> [yn— Efy,’}] converges with probability 1, that is, the series > yn 
z z 


§2 SERIES 117 
converges with probability 1 when centered, as mas to be proved. More 
generally, suppose that the infinite product TTo. | is convergent 


(rather than divergent to 0) on a t set A of posidive Lebesgue measure. 
If there is such a set A, 


tim |] |®,@|=1 tA 


vo y 


2 
so that there is a » for which fal |®,(¢)| >0 on a ż set of positive 
, 


Lebesgue measure. But then, according to what we have just proved, 


2 Yn converges with probability 1 when centered, so the same must be 


true of > 2 sh 
Proof of (ii) If SY converges with probability 1, Me i ®,, exists 
unitounly in every finite interval (and is the E Function of 


Sy) Moreover, since lim So = 0 with probability 1, it follows 
1 Noo N 


that lim EKA = 1 uniformly in every finite ¢ interval. Then the 
N+» y 


cy 


infinite product Il ©, is uniformly convergent in every finite ¢ interval. 
Conversely, if this product converges on at set A of positive Lebesgue 


measure, the same is true of TT KA |- There is then, according to (i), 
1 


a sequence of constants {c,} such that > (Yn— en) converges | with proba- 


bility 1. It follows from what we fave just proved vaT] CAO 


is a convergent product for all 4 On the other hand, To con- 

verges, by hypothesis, when / € A. These two facts taken together imply 

that 5 Cc, converges. Then > y, converges with probability 1, as was to 
1 


1 
be proved. 


118 MUTUALLY INDEPENDENT RANDOM VARIABLES IIL 


Proof of (iii) If > y, converges with probability 1, define Hi sites 7 
1 


as in (i). We have seen that the truncated expectations E{y,’}, E{yo’}, `+ 
are always absolute centering constants. If we suppose that 0, 0, - - > 
are also absolute centering constants, it follows that 


2 Elyn} < ©. 
We write c,, = Efy,’} and ©,(t) = ®,(t)e-**". Then by, I (11.10), 
|,()— 1] = |®,()— e- = | < |@,()— 1| + et 1 


M,(T)+T\c,|, — |t| <T, 
where 


M,(T) =— 2L,(T, a, a, a) f log |®,,(s)| ds 
0 
and a is taken so small that lah |®,(0)| > 4 for |t| <a. Since 
1 


SIMT) + [enl] < 0, 


the series 5 |®,,— 1| converges uniformly in Seery finite interval, and the 


Weierstrass M test is applicable. Conversely, if 5 |®,—1| COW on 
some ¢ set of positive Lebesgue measure, the infinite product I ®,, 


also converges on this ¢ set, so the series > Yn converges with probability 


1, by (ii). Since > |®, — 1| converges on has set regardless of the order 


of the summands, Sy, converges with probability 1 regardless of the 
1 


order of the summands, as was to be proved. 

The following two corollaries, of which the second is the important 
one, are now easily derived. The first is proved only because it will be 
used in VIII. 

COROLLARY 1 Lety, Yz, * ` * be mutually independent random variables. 


Suppose that > y; converges with probability 1, regardless of the order of 
1 
summation. Let Ay, As, ` > > be disjunct sets of natural numbers, with 


A=UA,, and define X,= > y; Then X,, Z, ->+ are mutually 
1 


jeA, 


§2 SERIES 119 


independent random variables, and Se 5 y; with probability 1, 


1 eA 
where the sums involved converge with probability 1, regardless of the order 
of summation. 

The series for X, converges with probability 1 by Theorem 2.6 (ii), and 
X, Lg, © © + are obviously mutually independent. Let ®, be the charac- 
teristic function of y, According to Theorem 2.7 (iii), 


> |®,.0=1| < œ 
1 
forall t. Hence (see the Appendix for a proof of the inequality used here) 
> 1 SO= e232 POT 
n=l jedn n=1 jeA, 
for all t, and since the nth product on the left is the characteristic function 
of B,, the inequality implies, again by Theorem 2.7 (iii), that > X, con- 
$ 
verges with probability 1 regardless of the order of summation. The 
difference between this sum and the sum > y; is evidently independent 
jed 


je 
of Yı ` * + Yn for every n, and therefore is identically constant with 
probability 1, by the zero-one law. The constant must be 0 because the 


two sums have the same characteristic function li ©, The order of 
the factors in the latter product does not affect the value of the product, 
since D IP) 1| < œ. 
COROLLARY 2 Let yy, Yo, ` * be mutually independent random variables. 
Then, if the partial sums of S Yn converge in distribution or in probability, 
1 


the series converges with probability 1. 
Since convergence in distribution of the partial sums means that 
n 


| | ®, converges uniformly in every finite ¢ interval (n> œ), and since 
1 


convergence in probability implies convergence in distribution, this state- 
ment is a special case of part (ii) of the theorem. 

The hypotheses of the following theorem seem somewhat artificial at 
first, but they are frequently true (cf. the application of this theorem in 
VIII §6). 

THEOREM 2.8 Let Yy Yz ` ` ` be mutually independent random variables, 
and suppose that there is a random variable y for which, if 4,, is defined by 


9 tte tant An =o 


120 MUTUALLY INDEPENDENT RANDOM VARIABLES I 


A,, is- independent of Yı, ` * *, Yn. Then > y; converges with probability 1 
when centered. x 

In fact, if ®, is the characteristic function of y, ¥, that of A;, and © 
that of y, 


TTE =f] T1@@ll¥,@] = |e). 
1 1 


Hence TT [DA| > |®(2)| > 4 for small ¢, and Theorem 2.7 (i) can be 


applied ic give the stated result. m 
In discussing the convergence of a series 24 > y; of mutually independent 


random variables it is sometimes Convenient to use a symmetrization 
procedure. Let y,*, yo*, - + be random variables defined in such a 
way that y; and y;* have the same distribution and that y1, y,*, Yo, Yo", °° * 
are mutually independent. (If the given space is not complex enough 
to support such a sequence {y,;*}, the w space may be adjusted by the 
adjunction of a space with such a sequence, as described in II §2.) Then 
the characteristic function of (y;— y,*) is |®,|*. According to Theorem 


2.7 the series > y; converges with probability 1 when centered if and only 
1 


it | | |®,| converges uniformly in every finite ¢ interval, and 5 (y; — 93") 
1 1 


converges with probability 1 if and only ap |®,|? converges uniformly 


in every finite ¢ interval. Then Šv converges with probability 1 when 


centered if and only if Su- 1) converges with probability 1, The 


conditions of Theorem ‘27 for convergence with probability | when 
centered, and for convergence with probability 1, coincide for the series 


SO Y). The introduction of y;* makes it possible to reduce the 
a of the convergence of Sy, when centered to the special case 
when the characteristic functions of the summands are real and > 0. 
The fact that the convergence of 5 (y;— y;*) with probability 1 implies 
the convergence of Šv, when d can also be obtained as a 


consequence of Theorem 2.8, with 


iS 2Y;— y) A, = zs F 2 Ys y”). 


§2 SERIES 121 


THEOREM 2.9 Lety Ya ` ` ` be mutually independent random variables, 
and suppose that for some K >0 


lim sup P{|>y,(@)| < K} > 0. 
no T 


Then > y; converges with probability 1 when centered. 
il 


By hypothesis there is an £ > 0 and an increasing sequence {nm} of 
positive integers such that 


PiIZy(o)| <K>, 1=My 


Let y,*, Yg”, © © be symmetrizing random variables as in the above 
discussion. Then 


n n n 
PLS Wo) — yAol S28} = PIO] E K, Burl < K 
SA RS Ne 
If , is the characteristic function of y; we have 


6 ; À 
Hg 1 oF Se 
x) ii |)? dt = 55 Í af ei dP{lyo)— OET 


C 


= | M arte [y(o) — y*(@)) < 4}. 


26 


-o 


By Helly’s theorem there is a subsequence of the distribution functions 
Of Ya = Yn U U which converges for all A to some bounded 
monotone function G. Then, since the last integrand in the preceding 
equation vanishes at + ©, 


6 oO 
(2.17) 1 fil Oi 
fine! 


f sin 25 
| a Gt). 


-o 


If the series Sa does not converge with probability 1 when centered, 


1 e- 
the integral on the left vanishes, by Theorem 2.7 (i). But then 


r sin Ad 
5 = = Ge ô—> 0), 
0= | T dG(A) = G(œ)— G4- œ) ( ) 


> G(K)— G K) > ®, 


‘and this contradiction implies the truth of the theorem. 


122 MUTUALLY INDEPENDENT RANDOM VARIABLES ll 


n 
This theorem shows that the distribution of X y; goes out to infinity 
«2 wA S 1 à 

if X y; does not converge with probability 1 when centered. In this case, 
1 


then, 

lim (Šuo) — c)| < K}=0 

n>a 
for all choices of centering constants c}, Ca * * > and all K> 0. Another 
way of writing this is 


lim’ L.U.B. P(IISy (@)]— d| < K} =0. 


n> —w<d<o 
The nth least upper bound in this equation is a function of K which 
measures the concentration of the distribution of Syn The study of 
series Šv of mutually independent random variables has been carried 


through by Lévy in a way which bases itself on such functions of 
concentration, and Kawata has based the theory on averages of these 
functions. 


3. The law of large numbers 


Let Yy» Yo, *'' be random variables. If for some constants 
ay, by, do, bg, * > + 


(3.1) lim — S (y;— 


no z 


exists in some sense of convergence, the sequence Y1, Y» * * * is said to 
be subject to the law of large numbers (relative to the centering constants 
ay, a, * * < and scaling constants b}, b», + * *). The law is described as 
the weak law if the convergence in (3.1) is convergence in probability, the 
strong law if the convergence in (3.1) is convergence with probability 1. 
It is of course always possible to choose the b,’s so large that the limit 
in (3.1) exists with probability 1 and is 0. 

In the present section the problem is simplified by the hypothesis that 
the y;’s are mutually independent. AS a first example of the significance 
of this hypothesis note that, if b, > 00 in (3.1) (or even if lim sup b, = œ) 


no 
and if the limit a in (3.1) is a limit in probability, then x is a random 
variable which is unaffected by changes in the values of a finite number 
of ys. Hence, according to the zero-one law (Theorem 1.1), w is iden- 
tically a constant, with probability 1. This constant can be absorbed in 
the centering constants {a;} if desired. 


§3 THE LAW OF LARGE NUMBERS 123 


THEOREM 3.1 Let Yı, Yo ` ` * be mutually independent random variables 
with characteristic functions ®,, D,,- > > and let by, ba, > * + be any non-zero 
constants. There are constants a, a, ` * *, for Which 

pollo 
(3.2) p lim — > (y;— 4) = 0 
no Dy 1 
if and only if, 
3.3) lim | | |®,@/6,)| =4 
no 4 


uniformly in every finite t-interval. 
If (3.2) is true, the distributions of the quotients converge to the 
distribution concentrated at 0; hence 


n 
lim | | ®/(t/b,)e™"% = 1 
n>a j-i 
uniformly in every finite t-interval, which implies (3.3). Conversely, 


n 
suppose that (3.3) is true. Choose c,, as a median value of &, Y;/bn, and 
define a}, ag, * + * by om 
a = b6 


a; = by;— ba» j> 
Then by I (11.8’), if u > 0, 


=; 


0, (4) dt, 


and the right side goes to 0 with 1/7, by (3.3). Note that the hypothesis 
of uniformity in (3.3) was not used in this sufficiency proof. When 
„ = n we shall use the following fact without special mention in each 


=} = e[l È mo en 


1 2 
P| i 2 lyfe) — a;)] 


<—4L(u, a, a) | log | | 
0 j=1 


n 
application: if lim Sy, exists, then lim y,/n =0. In fact, then 
no M1 n> 


is m gies Ei Sy bsg 
= li = _— = lim —- — =. 
2 din 2y; ea 1 4s noo NT yy n>oN T d 
Lov 
noo M 


If the y;s are random variables, “lim” can be interpreted as “p lim” or 
“li.m.” in this equation. 


124 MUTUALLY INDEPENDENT RANDOM VARIABLES II 


In the most common case, b, = n, a; = Efx,}, and the limit in (3.1) 
is 0. If this is so, and if the limit 
aml 
(3.4) lim F 2 Efy;} 
exists, we can take a; = 0, and the limit in (3.1) will be that in (3.4). 
One problem of the theory is to find conditions under which the law of 
large numbers holds with these constants. For example, suppose that 


Yi» Ya * * * are mutually independent with variances or, oz, > *. Then 
l n 
the average x, = 5 > y; has expectation and variance 
1 
1 n 
65 Efe,} =| > Elu} 


Lee 
Ele, — Eee = 532 oF 


Hence, if 


n 


1 
(3.6) lim — > 67 = 0, 
aran 


which will be true for example if o; < const., j > 1, the variance of «, 
goes to 0, that is, 

Gn) Liam. $ $ [y —Bfy)] = 0. 

Thus in this case the law of large numbers holds with a, = E{y,,}, bn =” 
in the sense of convergence in the mean (which implies convergence in 
probability) and conversely (3.7) implies (3.6). These considerations have 
barely used the fact of mutual independence of the y;’s and have in fact 
resulted in no stronger result than the corresponding weak sense Theorem 
5.1 of IV, in which the y,’s are only supposed orthogonal. However, 
far stronger theorems can be proved. We first prove a partial converse, 
restating the direct part, for completeness. 


THEOREM 3.2 Lety, Ya, * ` * be mutually independent random yariables 
with finite variances oè, o, * + +. Then (3.7) is true if and only if (3.6) 
is true. Moreover 

i n 
(3.8) Li.m. z Zu a)=0 
n> 1 
if and only if (3.6) and 


i eles 
(3.9) lim -= z [E{y;}— aj? = 0 


n= 


§3 THE LAW OF LARGE NUMBERS 125 


are trie. If, for some constants Cy C» °° lim sup" < œ and 
[Yal < en» then PARRA 
1 n 
8.8) p lim- > (y;—4;) =0 
n> MY 


is true if and only if (3.8) is true. 
The equivalence of (3.6) and (3.7) has already been noted. The 
equivalence of (3.8) and (3.6) combined with (3.9) follows from 


e| E 2 U= | = A 5 [o + Efy}— a)l. 


If (3.8) is true, the weaker (3.8’) is always true. If (3.8’) is true and if 
lu] < enj = 1, let O, be the characteristic function of y; and let m; be 
a median value of y Then (3.8’) implies by (3.3) that 


n 
tim || [2| = 1 
no a 
uniformly in every finite t-interval, and, according to I (11.14) applied 
to (y;— m;)/n, 
n 
2 Sop <— 3log | | (00| > 0, 
nT T 


-1 
u< fema] i 
jon 11 


Thus (3.6) is true and consequently (3.7) is also true. Since (3.7) implies 
convergence in probability, it can be combined with (3.8’) to give 


le 
lim — > [E{y;}— 4] = 0, 
nao 1 


and this equation in turn combines with (3.7) to give the desired (3.8). 
THEOREM 3.3 Let yy, Ya ` * ` be mutually independent random variables 


and let c be any positive constant. If 


610 tim Š Pllyso)| > c} = 0, 


n>% j= 


and if, when yn; is defined by 
Yno) = YK)» O| Sen 
=0 |y,(o)| > en 


126 MUTUALLY INDEPENDENT RANDOM VARIABLES ul 


and has variance o,,?, 


1 n 
(3.11) lim = > on =0, 
Percoll 
then 
mee? 
(3.12) plim- > y—a)=0 
no M j=l 


for some constants ay, ag * * * satisfying 


(3.13) lim sup aS =r 


n>n 


Conversely, if 
12 
(3.12) ae plim- > y;=0, 


n= o j=1 
then (3.10) and (3.11) are also true, for every c > 0. 
To proye the direct half, note that 
Piyn (o) = yo), j <n} 2 1— 2 P{ly,(o)| > en}, 
= 
so that by (3.10) 
Salih 
(3.14) plim- X (yj—Ynj) = 0. 
nael 


n= o 


Now (Theorem 3.2, (3.11) implies 
tie $3 
plim- $ Ynj— Elfyn) = 0 
n>% M j=) 
so that, combining the last two equations, 
AE 
plim- > Y— Efyn}) = 0, 
now Mia. 
which implies (3.12) for properly chosen a,’s. This equation in turn 
implies 
«Yn an _ 
p lim a 0, 


nD 


whereas, by (3.10), P{ly,(w)| > cn}—> 0. Hence (3.13) is tru 
Conversely, suppose that (3.12’) is true. Then, if ®; is the e 


Wa Yi 
(3.15) tim | Poan = 1 % 
n>o 1 


3 THE LAW OF LARGE NUMBERS 127 


ore) 


uniformly in every finite interval. Let a,’ be a median value of Yy, 


Then, since by (3.12’) p lim = = 0, it follows that 


ji 


S a, 
lim + =0. 
Pana! 


By I (11.8), and by (3.15) 


poi > n) <— 4L,(u, u, H) J tog] ] |®,(z/n)| dt > 0, 


Ze 
1 
and this combined with the preceding equation yields (3.10) if u < 6. 


Now, if Yn; is defined as in the statement of the theorem, we have seen 
above that (3.10) implies (3.14), and (3.12’) together with (3.14) imply 


11 n 
plim- > Yn = 0. 
noo M jot 

The reasoning which in the proof of Theorem 3.2 led from convergence 
in probability (3.8’) to convergence in the mean (3.8) is applicable without 
change here, leading to (3.11). 

If the condition (3.6) is strengthened slightly, the conclusion of con- 
vergence in the mean and in probability can be strengthened to convergence 
with probability 1. To simplify the notation we take Efy;} = 0. 

THEOREM 3,4 Let yy, Ya * * ` be mutually independent random variables 


with 
E{y;} = 0, Efy?} = 07 < 2. 

ð oe 
Then, if eh ? <0, 

1 

AS aa mtg Sane Os 
(3.16) Lim, 28s Sims Tho 
n> no 


with probability 1. 
According to Theorem 2.3 the present hypotheses imply that the series 


Ss 
g 
è 2 j 
is convergent in the mean and with probability 1. Now, if S, = 2 A 


n 1 n n=l 7 
BaN 1S y= 39 20+ DSA 


128 MUTUALLY INDEPENDENT RANDOM VARIABLES Hi 


When n —> co the last term on the right converges to S both in the mean 
and with probability 1. The same is true of the average on the right 
since convergence in the mean and ordinary convergence imply the same 
type of convergence to the same limit when averages are taken (Cesàro 
summability). Hence the limit on the left exists both in the mean and 
with probability 1, and is 0. 

Although this theorem has made use of the hypothesis of mutual 
independence of the y,’s, the weak sense version (IV, Theorem 5.2), in 
which the only qualitative hypothesis is that the y,’s are mutually or- 
thogonal, has only slightly stronger conditions on the sequence Efy}, 
Ey}, °° 

It is sufficient for the convergence of the critical series if o,,°< const., 
n> 1, and the classical examples of the law of large numbers are special 
cases of this. For example, suppose that 


Pym) = 1} = pp P{y,(~) = 0} = q; = 1— py 
Then 
Eyj=p, oF = pig St 


so that, by Theorem 3.4, 

12 
lim — — p= 
im = > ly;— pil = 9 


no 


with probability 1. If the p,’s are all equal, p; = p (Bernoulli), we have 


seg ie 
lim - $ y; = p; 


no My 


serge LA 3 X P A 

if lim a Sp; = p exists (Poisson), the same equation holds, with the 
nn 1 

new interpretation of p. The left side is in the usual language the “success 

ratio,” the number of successful trials out of n divided by n, and the 

theorem in the Bernoulli case states that the success ratio approaches with 


probability 1 the probability of success in each trial. 


4. Infinitely divisible distributions and the central limit theorem 


Let æ be a random variable, and suppose that it can be expressed in 
the form 


(4.1) B=Yt °° +Yn 
where 7, ** ‘Yn are mutually independent. This is no restriction on 
the distribution of a, since we can take y, = 2, y =***=Y,—=0. If 


it is supposed in addition that the y,’s are small, however, (4.1) is a real 


§4 INFINITELY DIVISIBLE DISTRIBUTIONS 129 


restriction on x. To avoid discussing the somewhat extraneous issue of 
the complexity of the relevant œ spaces we shall formulate conditions in 
terms of the distributions or their characteristic functions rather than in 
terms of corresponding random variables. 

A distribution function F is called infinitely divisible in the generalized 
sense if, for each y > 0, F can be written as a convolution of distribution 
functions Fy,* + *, Fns 


a e aA 
(4) FA) = | dE): | Faha: i An dEn aa) 
with ith F 
1— Aqta jan 
Ifyy,* * "s Yn are mutually independent random variables with distribution 


functions F,,- * *, Fw their sum x will then have distribution function F 
and Pilyo) >m jan Note that both n and the F;s will 
depend on y except in trivial special cases. 

Evidently ® is the characteristic function of an infinitely divisible law 
(generalized sense) if and only if, for each ¢ > 0, ® can be written in the 
form 


(4.19 o=| |, 


where ®,, - - +, ®, are characteristic functions with |1— ,(4)| < e for 
li| < Ie; n and the ,’s will depend on e except in trivial special 
cases. 

A distribution function F is called infinitely divisible if for each n it can 
be written in the form (4.1) with A =" °° = F,, that is, if for each n 
its characteristic function ® is the nth power of a characteristic function, 
© = ,,". If 6 >0 is chosen so small that D(z) 40 for |r| < ô, it is 
clear that lim ¥,(t)= 1 uniformly for |t| < ô, and (cf. I §11) this 


implies that’ lim Y(t) = 1 uniformly in every finite f-interval. Thus ® 


must be the characteristic function of a distribution that is infinitely 
divisible in the generalized sense. It will be shown below that conversely, 
if ® is infinitely divisible in the generalized sense, it is also infinitely 
divisible. In other words, it will be shown that the added hypothesis 
that the F;’s in (4.1’) or the ®,’s in (4.1”) are identical is no further restric- 
tion on the given distribution; the phrase “in the generalized sense” will 
then be superfluous and will be omitted thereafter. 

Before proceeding we give some simple examples of infinitely divisible 
distributions, obtaining the decomposition (4.1’) or (4.1”) in each case. 


130 MUTUALLY INDEPENDENT RANDOM VARIABLES 1 


(a) Let the distribution be concentrated at a single point y, so that 
ITEE Aa<y 
(4.2) =1 227, 
P(t) = e. 


Then, for every n, F is the convolution of n distributions each concentrated 
at y/n, D(t) = (em. 

(b) Let the distribution be Gaussian, with expectation y and variance 
o?. For each n this is the convolution of n Gaussian distributions, with 
expectation y/n and variance o?/n; 


O(t) = [ened], 


(c) Let the distribution be a Poisson distribution: 


n 


(4.3) Pfa(o) = n} = e~ = A ON ec O 


The characteristic function ® of this distribution is given by 
(4.4) log ®(t) = c(e't— 1) 


and ©" is then the characteristic function of the same distribution except 
that c is replaced by c/n. More generally, if æ has the distribution (4.3), 
the characteristic function of ax + b has logarithm 


(4.5) ith + c(e— 1) 


and the (properly chosen) nth root of the characteristic function is the 
characteristic function of the same distribution with a, b, c replaced by 
a, bjn, c/n. 

We shall show below that every infinitely divisible distribution in the 
generalized sense has characteristic function ® given by Lévy’s formula, 
stated at this point in Khintchine’s form, 


ERN AONTA T 
(4.6) log ®(t) = iyt 4 | (e 1 T+ á BR dG(A), 
where G is monotone non-decreasing and bounded, and the integrand is 
defined as its limiting value — t?/2 when A = 0. Before proving this we 
shall examine the functions ® given by (4.6) in various special cases: 


(a) GA) =0. 


§4 INFINITELY DIVISIBLE DISTRIBUTIONS 131 


Then ®(t) = e", and ® is the characteristic function of the infinitely 
divisible distribution discussed under (a) above. 


(b’) G(A) = 0, A4<0, 
=0%, A>0. 
Then 
o 
log O(t) = iyt— 5 t 


2 


and ® is thé characteristic function of the Gaussian distribution with 
mean y and variance g? discussed under (b) above. 


(c^) GA) =0, 1< ło 
=Cy Åp where co > 0, Ags 
Then 
j” } itho ith ) i+ Aq" 
log P(t) = iyt + (e 1 TFA) ae Co 
i 1) kA 
i —|t+c (ei — 1), 
(, Ay? ag 


which is the Poisson case discussed under (c) above. Note that 4, the å 
at the jump of G, is the magnitude of the increment in the Poisson 
distribution. 

Now, since the integrand is bounded and continuous, the integral in 
(4.6) always exists. Moreover, if — N=4<' <A =N and if 
le <7, 


Bly ita; |1 +4? 
|log D) — {iyt + > [e 1 i iar | (GA) — CAd 
1 j j 


< K[G(o)— GW) + G(— N)— G(— %)] + nlG(N)— GC N), 


where K is the L.U.B. of the absolute value of the integrand in (4.6) for 
all A, |t| < T, and 7 is the maximum oscillation of the integrand in the 
n intervals (Ag) Ay), © * *» Ant» An) for |t| <T. The right-hand side can 
be made arbitrarily small for fixed T by choosing N large and then 
choosing Max (A;— 4;) small. Hence the function in (4.6) can be 


expressed as the limit of a sequence of functions obtained by replacing 
the integral by properly chosen Riemann-Stieltjes sums, and the limit is 
uniform in every finite t-interval. Since the expression with sums is the 
logarithm of a characteristic function—it is the characteristic function of 
a convolution of Gaussian and Poisson distributions discussed under (b) 


132 MUTUALLY INDEPENDENT RANDOM VARIABLES Ill 


and (c)—log ® as defined by (4.6) must also be the logarithm of a 
characteristic function. The corresponding distribution must be infinitely 
divisible since (1/n) log ® is given by the same formula with y, G replaced 
by y/n, G/n. 

THEOREM 4.1 An infinitely divisible distribution is infinitely divisible in 
the generalized sense and conversely. A distribution is infinitely divisible 
if and only if its characteristic function ® never vanishes and is given by 
(4.6), where G is monotone non-decreasing and bounded, and y is real. 

It will be sufficient to prove that the characteristic function of an 
infinitely divisible distribution in the generalized sense can be written in 
the form (4.6). In fact, we have already remarked that this form always 
gives infinitely divisible distributions. By hypothesis, for every £ > 0 we 
can write ® in the form 

o=o, 
3 


where ®,,; is a characteristic function, and 


1 
=o] <e for |t| = 


The minimum number of factors n depends on e. Then, if e< l, 
D) #0 for |t| < 1/e, so that P(r) never vanishes. We center the 
distribution corresponding to ®,; as we have done repeatedly, by means 
of a truncated expectation; if F, is the distribution function we subtract 
the centering constant 


Mejta 


my= | AdF(A) + mall — Fyny + o) + Fymj— a—)], 


Mej—a— 


where m,; is a median value of the F,; distribution, getting the new distri- 
bution function Âj, with F(a) = F(A + moj). Then if Ôj is the 
characteristic function of the centered distribution we can write 


H(t) = ev | | O,(0, 
j 


where y, is the sum of the centering constants. The positive constant « 
will be fixed throughout the following discussion. Then it is clear that 
as e—> 0 the individual ceritering constants go to 0 uniformly so that, 
changing the notation if necessary, we can suppose that 


l= =i 


§4 INFINITELY DIVISIBLE DISTRIBUTIONS 133 


We shall use o O notation referring to £ —> 0 with t restricted to a finite 
interval, and there will always be uniformity in t with this restriction. 


Now 
log ®(t) = ity, + X log O,(0 
j 


= ity, + ZIO 1) + O16,(9— 1 
j 


and by I (11.10), if e < 1/T, 
ZO- IP se rI6,0—1| 
d a 


rs 
<—2eL,(T, % 1, 1) f log aolas, <T 
0 


=a 
so that 
log ®(t) = ity, + X [Ê 0) — 1] + 0). 
a 


Now by I (11.8) and (11.9), using our evaluation of Ly, and the approxi- 
mation of log ® just obtained, if f, = X hy 
j 
i/u 


an F(o)— flu- + ÊC DSL (u 3 +) f ERI- Oo) at 


0 9 


1u 


< (1+ 2r)u J > HLL — ÈO] at 
a Í 


1/u 
=— (1+ 27)? | log|®(| dt + oC) 
0 


and 
u 


1 
(4.8) | Bahay < Lu, 1,1) f ZR OON 4 
o7 


=i 


1 
Zope 1 f log |®(¢)| dt + o(1); 


here o(1) is uniform in t, 4, for |t| < T, 1/u < T, for any fixed T, in the 
first inequality, and for \t| <T, fixed u, in the second inequality. Then 
if G, is defined by 


a 


2 
c= | ath) 


—2 


134 MUTUALLY INDEPENDENT RANDOM VARIABLES Til 


G, is monotone in A, uniformly bounded when e — 0, according to (4.7) 
and (4.8), and, according to (4.7), 


lim GA) = G(+ ©) 
iat 


uniformly in e when e—> 0. By Helly’s theorem a decreasing sequence 
of values £;, €» * * * of e can be found along which G,(A) > G(A) for all 
4, where G is monotone and bounded; and G(— œ)= 0, G(o) 

Sal G, (©) according to the remark on uniformity just made. Now, 


arena Ê, by G,, log ® can be put in the form 
(4.9) log DU) = ity, + X [O,()— 1] + o(1) 
J 


= ity, + i eu (2) + ofl) 


Bet aie ith \1+2 
ity! + | (a es = dG.) + o(1) 


where 


: T dGJa) eet 
= : 1 = y 
n =r: Í tet | ret 


When «— 0 along the sequence {e,}, the integral in the last line of (4.9) 
goes to that in (4.6) uniformly in every finite t-interval (cf. the discussion 
of Stieltjes integration to the limit in I §11), and, since the left side of 
(4.9) does not depend on e, y,,’ must converge to some limit y. This 
finishes the proof, 

It will be useful below to know that ® in (4.6) determines y and G 
uniquely (assuming, say, that G is normalized by supposing it continuous 
on the right and 0 at— æ). To show this note that 

t+1 o 3 2 
(4.10) log @(r)— 4 f log ®(s) ds = Í et (1 se?) ip dG(A) 


tag 


= j e” dH(A), 


=o 


where H is monotone non-decreasing and bounded. Thus H is the 
“distribution” function corresponding to the “characteristic” function 


$4 INFINITELY DIVISIBLE DISTRIBUTIONS 135 


log ® and is, together with G, therefore determined by ®. The constant 
is then determined uniquely as the difference between log ® and the inte- 
gral in (4.6). 

Note that in the derivation of (4.6) there appears to be an ambiguity in 
the definition of G since it was defined as the limit of some sequence of 
functions {G,,}, and at that stage it was conceivable that a different 
sequence would give a different limit. Since G has now been shown to 
be uniquely determined at all its continuity points by ®, if G(— 00) = 0, 
it follows that G is the only possible limit, that is, lim GA) = G(A) at 


e>0) 


all continuity points of G; similarly lim y, = y. 
e0 


In the most important particular case G is constant except for a jump 
when å = 0, so that the given infinitely divisible distribution is Gaussian. 
The following corollary gives a sufficient condition for this. 

Corottary | If for every n >O a distribution function F can be 
written in the form (4.1’) with d 


ZIL- Fin) + BC ms 
$ 


then the distribution function is Gaussian. 


The condition on F is much stronger than that of infinite divisibility. 
To prove the corollary we follow through the proof of the theorem, using 
the added hypothesis of the corollary. Then the F; of the corollary 
corresponds to F,; in the proof of the theorem, that is, in the proof of the 
theorem we can suppose that, for any 7) > 0, 


SU Fy) + Fo WIS % 
J 
if e is sufficiently small. If 7 <4, |m,| <7 and [moy] < (2 + &)- 
Then the condition on the F,;’s implies the condition 
1— f(r’ -) + A n) = ZN Aya) + F(a 
3 
n = n8 + 2), 


on Ê. Thus the Ê, “distribution” becomes concentrated at the origin. 
The same is then true of the G, “distribution,” so that G must be constant 
except for a possible jump at the origin. Then the original distribution 
is Gaussian. (We identify a random variable that is identically constant 
with a Gaussian variable having variance 0.) This corollary is not 
vacuous, since, if x is a Gaussian random variable with 


Efa}=0, Ef} = °> 0, 


136 MUTUALLY INDEPENDENT RANDOM VARIABLES UI 
n 
and if we write x in the form x = > y;, where y,, * * *, y, are mutually 


independent with > F 
E{y,} = 0, Ey} = m 
then F 
Z Pilys(o)| = 0} = nello) = 0} 


æ 
2n 
Paia —1*/20° 
= e dì, —> 0. n= ©, 
V20 f ł 
a nVn 
In terms of random variables this corollary states that, if a random 
variable x can be represented as the sum of mutually independent random 


variables, « = > y,, with P{Max|y,(@)| > n} < 7 for arbitrarily small 1, 
1 j 
then x is Gaussian. In fact the conditions 
È Plu) >n} <n,  P{Max |y,(o)| > n}< n 
d a 


are equivalent in accordance with the fact (see §1) that the probability of 
at least one of a number of independent events (in this case |y,(w)| > ») 
happening goes to zero together with the expected number happening. 

In the above discussion we took a given distribution and analyzed the 
restriction imposed on it by the hypothesis that it was a convolution of 
distributions concentrated near the origin. A slightly more general 
question is the following. Suppose that y,, * * -, y,, are mutually inde- | 
pendent and small, 


Pily{o)|Se<e, j=1, +n. 


Then what can one say of the asymptotic character of the distribution of 
their sum v? If ®,; is the characteristic function of the y; distribution 
and if ®, is that of the x distribution, 


©, =| | %,. 
j 


The difference between this question and the one treated already lies in | 
the fact that the left side depends on the y,’s. We can use the notation 

and methods of the proof of Theorem 4.1, however. This proof shows 

that, if when e > 0 the y,’s are chosen so that x has a limiting distribution, 

then this limiting distribution must be given by (4.6), and thus be infinitely 

divisible. In particular, if a given distribution can be approximated by 

infinitely divisible distributions it can be approximated by convolutions 

of small y;s, so that it must be infinitely divisible, that is, we have the 
following corollary: 


§4 INFINITELY DIVISIBLE DISTRIBUTIONS 137 


Corortary2 Any limiting distribution of infinitely divisible distributions 
is infinitely divisible. 

It is easy to show, using (4.10), that the G’s of the approximating 
distributions—cf. (4.6)—converge to that of the limit distribution, at 
the points of continuity of the latter, if all are normalized to be 0 
at — 0. 

The most natural theorem from this point of view is that the distribution 
of the sum of a large number of small mutually independent random 
variables is near some infinitely divisible distribution whose y and G in 
(4.6) are expressible in terms of the summands. In proving such a theorem 
one can proceed as in the proof of Theorem 4.1, but there is an additional 
difficulty. In Theorem 4.1 the distribution of the sum was given. Thus 
in inequalities (4.7) and (4.8), involving on the right the characteristic 
function of this distribution, the right sides are fixed. But from the 
present point of view the ® in those inequalities is not fixed, and in fact 
it is to be proved that this ® has some asymptotic character when € is 
small. Thus the hypotheses of the theorem must themselves limit the 
left sides of (4.7) and (4.8). We shall restrict ourselves to the case of a 
Gaussian limiting distribution, that is, to the case of a G in (4,6) which 
is constant except for a jump atA=0. We shall thus be proving one 
form of the central limit theorem, a generic name applied to any theorem 
which states that the sum of small random variables, under appropriate 
restrictions, is nearly normally distributed. 

THEOREM 4.2 Lete =y, ++ * * + Yn be asum of mutually independent 
random variables. Let n, %, b be positive numbers, with \/b <a < b, and 
define 


y= | AdPylo) <4, v=2% 


of = Í # dP{y(o) < å} — v4 = > a}. 


ote 
Suppose that o° = b and that 
(4.11) Z Pilydo)| >an 


Then there is an n, depending only on %, b, and going to 0 with x for fixed 
b, such that 
A 
Lf ee du| <n —a<A<o. 
ov an, 


(4.12) | Pfe(o)—7<4}— 


138 MUTUALLY INDEPENDENT RANDOM VARIABLES IIL 


Conversely, if & =y, ++ ` *+Yn is a sum of mutually independent 
random variables with zero medians, if , «, b are positive numbers with 
l/b <a <b, if y, o are defined as above, if 


(4.13) Piulo >s sel, 

and if : 

(4.12) | P{e(w) —7 <4}— wea | ea du |< hn —2<A<o, 
EVIT 


with |p| < b, @ <b, then there is an y depending only on b, n, such that 
1 goes to 0 with y for fixed b, and that 


414 SROS le- Plan  |y-Fl<y. 
J 

The condition (4.11) can be replaced by the asymptotically equivalent 
condition 
(411°) P{Max |y,(w)| > 7} < 7. 

ján 

To show the real simplicity of the direct half of this theorem, we prove 

it without recourse to the methods of Theorem 4.1. Define y,’ by 
y; (w) = y0), |y,(o)| La 


=0, luo) > a, 
so that 
Y; = Ety;}, oè = Ey — y,)*}. 
From now on we shall assume that 7 is so small that 
1/b a b 


GR 120+ 1/6) ~ 30 + @) S20 Fy 
Then 


rs] <n + Pilykr)| > n} < nfl + b), 
oP < Ely) < n + 402P{\y,(o)| > n} (1 + 46%), 
Ejly/ — y} < 12 + Do? + 8a P{ly,/(w) — y| > (2 + D} 
< 12 + bo? + 88P{\y,/(o)| > 7}, 


so that 
3 E{ly,/ ~ p|} n2 + b)? + 808 < ml(2 + bye + 86°] = bin. 


$4 INFINITELY DIVISIBLE DISTRIBUTIONS 139 


Now according to I (11.4) with n = 3, if , is the characteristic function 
of yj — Yp 
o?t® y He 
osi- tu aew r EE 
If T> 0, 


Lee lee oy nee 
3 i 7 S4, 


Max 
i 
for sufficiently small 7, and we can write 
oft 
log 0,(1) = ——- es, ripe Os Teed | 
where 
oft 2 
KES (a ah u) : 
Then if Ê is the characteristic function of > (y; — 7,), 
eS h 
(4.16) log b(t) = — T +> (u + v), \t|<7, 
J 
and 
Š oft off 
uy + ol < F bl + Max (6 +u) (22 + l) 
j J j \2 y\ 2 
Y bT? i u rO. W Ara (= 7 aa 


FFG 2 6 
l| <7, 


so that the sum in (4.16) goes to 0 with 7 uniformly in ¢ for |t| < T. 
Thus > (y; — y;) is asymptotically normal with mean 0 and variance o°, 


as Hon: Since 
Ply,(o) = y (0) j 2 > 1—= 


the distribution of > (y; — y) is asymptotically (7 > 0) the same as that 
of > (y; — yi), and this finishes the proof of the direct half of Theorem 4.2. 
j 


Suppose now that the hypotheses of the converse are true. We then 
assume that 7 + 0 along a sequence of values, that for each 7) value there 
is a sum x as described, that 7’ is given, and we prove that (4.14) is true 
for » sufficiently small. If Ọ is the characteristic function of a, the 
characteristic function Pe~” is by hypothesis asymptotically e~". On 
the other hand, the proof of Theorem 4.1 carried through here makes 
pei? asymptotically a characteristic function given by (4.6). (We can 


140 MUTUALLY INDEPENDENT RANDOM VARIABLES m 


assume that 7 and 6? are asymptotically near finite limiting values.) Then 
the function G in (4.6) must be constant except for a jump at the origin. 
According to the derivation of (4.6), this means that 


2P{lyfo)— vl =n} 


goes to 0 with 7. Since \y;| < nC. + b), it follows that if 7 is so small 
that 7(1 + b) < 7/2 then 


E PULO S13 E Ply — yl =a (M0). 
I J 


The direct half of the theorem now asserts that > (y; — y;) is asymptoti- 


cally Gaussian with mean 0 and variance of, so that y, o? must be 
asymptotically 7, 6°, respectively, as was to be proved. 

Two useful special cases of this theorem will be proved. To illustrate 
the method of proving these theorems the first special case (Theorem 4.3) 
will be deduced independently and the second one (Theorem 4.4) will be 


derived from Theorem 4.2. 
THEOREM 4.3 Let Yı, Ya ° ` * be independent random variables with a 
common distribution function, having finite variance o°. Then 


es 
a 2 [ys — Ely) 


is asymptotically Gaussian with expectation 0 and variance o°, 
a 


12 1 

lim Pi— > [y,(w) — Efy <i) = —— ferea ty 

nro (Wn j= ti ) w V20 J i Ms 
uniformly in 2. 

In fact, if is the characteristic function of y; — E{y,}, we have, using 
I (11.4), 
£ 
SO E aA R G) 

so that 


2 
log D(t) = — aa + o(t*) 


1 n 
ye PAU — Efy;}] 


if t is sufficiently small. The characteristic function of 
is O(t/Vn)", and 


ENa ona a 
weal) el) 
og Wh 2 +no 7 


Since this converges to — o?/2 uniformly in every finite interval the 
theorem is true. 


§4 INFINITELY DIVISIBLE DISTRIBUTIONS 141 


We conclude with a famous version of the central limit theorem due to 
Liapounoy, which has a wide range of applicability. 

THEOREM 4.4 Lety," * *, Yn be mutually independent random variables, 
and suppose that Efy;} = 0, E{ly,|?*?} = c; < ©, for some ô > 0. 


Define Ba, Cn by a h 
B, = EÈ y} = 2 Ely}. 


Cr =È c; 
T 
Then, if B, > 9, and if 
B TES ê, 
it follows that ; f. 
P{> y,(@) <A VB} =—= Í e "2 du + o(1) 
1 V20 


o 
where o(1) refers to e 0 and is uniform in 2 and in the y; distributions 
involved. 

Note that 

s$ = Ey} < pies d< g2(2+0) B, j= Jeh 
so that (summing over j) 
B, <nel@tOp,, nde tet d, 

This means that for e small the number of y,’s is large. 

Theorem 4.4 is easily proved directly just as Theorem 4.3 was, using 
I (11.4). However, it is instructive to derive it from the general result 
of Theorem 4.2. We apply Theorem 4.2 to the variables Dye tty 
y,B,, In the first place, if 7 > 0, the obvious generalization of 
Chebyshev’s inequality to the exponent 2 + 6 yields 


is 1/8 Cn AG 
pA P{ly,(o)| Bn eS 7B, a > ner < 


if e< n?*+?. Thus (4.11) is satisfied in the present case if e is sufficiently 
small. It remains to calculate the truncated expectations and variances 
used in Theorem 4.2. We have 


y= | AROBA} = By't | tauo 
{|A| Sa} {(Al>aB,'"} 


Spa Bo TBM? fale AP fo) A 
j j {la]>aB,) 


so that 


Cn 


E 
Z <a 
atap, t"! alt 


142 MUTUALLY INDEPENDENT RANDOM VARIABLES m 


Similarly 
op= | PAROA} y= B | dP) y 
{lA] <a} {lal SaB") 
=B; >i | #dPiuo) <A- 72. 
{la >2B,"""} 


Then (using the fact that |y;| < «) 
t— dof = 8B," > B dP{y(m) <A} + È y? 
7 j 


Í (\aj>uB,2} 


< B Z BY? f jaj?*? dP{y,(w) <A} + a > |y 
j 


{lal>aB,") j 


2G. 2E 
= aB, +" =s w” 


1N 


Thus this theorem follows from Theorem 4.2. 

As an example of the applicability of this theorem note that, if y1, Y2,* * * 
are mutually independent random variables with E{y,} = 0, E{|y,|??} < K 
for some constant K, then, if the variance of y; is bounded away from 0, 
Efy?} > > 0, 


3 Y; 
[Sewa]” 


is asymptotically normal, n —> oo, with mean 0 and variance 1. In fact, 
in this case we have c; < K and the critical ratio goes to zero, 


(Ga ‘2 nK 
Piro = F02 245 
BaN +0/2 n! +’l2g2+ ð 


0. 


5. Stationary case 


In this section we shall suppose that the random variables x, Xo, °° * 
are mutually independent and have a common distribution function. 
The first theorem we prove is the form of the strong law of large numbers 
applicable to this case: 


THEOREM 5.1 If xı, %, * * + are mutually independent, and have a 
common distribution function, with E{|x,|} < œ, then 
Me atte bs ctor 
(5.1) lim atta = Efe} 
n> 


with probability 1. 


§5 STATIONARY CASE 143 


Before proving the theorem we note that, if Efx} < oo, this theorem 


is a special case of Theorem 3.4, with o2=02=-- +. To prove the 
theorem, define x,’ by 
z(o) =2,(0), [eon 
= |a,(@)| > n. 


We prove first that x;(w) = «,(@) for all large j, with probability 1. In 
fact (Borel-Cantelli lemma), this follows from 


S Peto) + o) = Z Paol >} = EP 1 < xl <I} 


<Ef{{x,| + 1}. 
Hence 
ale dk 
lim SAG lg alg 


n>n 
with probability 1, and we shall therefore discuss first the x,’ averages. 
Now 


S Ble BI S LS | Baro <2 
2 


2 
= n s 


z j G- j 22 dP{\a,(w)| <4} 
=| 


du 
=l 


v PAR 
oA drie <4} +4 


< 2E{|2,|} + 4. 
It now follows from Theorem 3.4 that 


lim ; 5 [xj — Efz;}] = 0, 


noo T 
with probability 1, so that 
1 n P 
lim - > [z,— Efx, y =0 


no 1 


with probability 1. Since lim Efx, } = E(x}, 


lim | 5 Efa/} = Efe}. 


cae 


Adding this to the previous equation gives the desired result. 


144 MUTUALLY INDEPENDENT RANDOM VARIABLES HI 


A sequence of mutually independent random variables with a common 
distribution function corresponds to the physical picture of repeated 
trials of an experiment (giving numerical answers). The theory of 
probability was originally invented to deal with this situation, and dis- 
cussions of the foundations of probability usually restrict themselves to 
it, Now two facts are very striking to anyone who actually performs 
repeated trials, and any mathematical analysis must contain theorems 
which correspond to these facts. 

(A) As n increases, the sample ratios {[%() + > * * + x,(w)]/n} are 
observed to vary only slightly. The mathematical version of this fact is 
Theorem 5.1. 

The existence of the limit in (5.1) cannot be verified in practice, of 
course, because infinitely many trials are involved. In fact, the experi- 
menter’s observations cannot a priori be exact enough for him to demand 
of the theoretician that in the latter’s theoretical structure the limit in 
(5.1) exist with (mathematical) probability 1. It is gratifying that the 
limit exists in this strong sense, but it would not be disturbing if it only 
existed in some weaker sense, say as a limit in probability. 

(B) The experimenter also notices that the sample ratios cluster around 
the same value as before if the results of some trials are not counted at 
all. For example, if the experimenter goes out to lunch, leaving his 
apparatus going, but with no recordings being taken, the noon hour x,’s 
will be irretrievably lost, but, if they are simply ignored, and the v's 
recorded only when the experimenter is present and interested, the cluster 
value is unaffected. More generally, the cluster value is unaffected if the 
experimenter, instead of merely ignoring certain trials because of the 
pangs of hunger or boredom or love or because of other irrelevant reasons 
independent of the trials, actually ignores some because of the results of 
past trials, say out of disgust with past results. In other words, the 
criterion of acceptance or rejection may depend on past results. This can 
be expressed formally as follows: Let x(w), x(w), > + be the full sequence 


of (original) sample values, and let «,/(w), xy (œ),  - + be the x,(@)’s that 
are actually recorded. It is supposed, then, that there are integers 
n <n <* +: anda, =x, ; the ns may be random variables. Now 


the experimenter may choose the trials to record, that is, the 7,’s, on the 
basis of his lack of appetite, his hunch that the experiment is going well, 
the results of previous trials, or general cussedness. However, we shall 
not allow him to be clairvoyant. This is interpreted by prescribing that 
after he has recorded x,- (©) = %,,_o)(@), his choice of the next z;(w) to 
record is to be based only on past events; that is, the condition n,(w) = 7 
is a condition on %,- + -,%, alone. In mathematical terms 2, ta * * * 
are random variables, ^, N, * * - are integral-valued random variables, 


§5 STATIONARY CASE 145 


and the w set {n,(w) = 7} differs by at most an @ set of probability 0 from 
an w set of the form {[x,(), - - -, %(@)] € A}, where A is a y-dimensional 
Borel set. The question is thus reduced to the relation between the old 
random variables a, ə '*' and the new ones T i to a Oa 
wf =ü, In the following we shall assume that in this way N < œ% new 
random variables æy’, æg’, * * + have been defined, and we shall simplify 
the argument unessentially by assuming that these random variables are 
certainly defined, that is, each , is defined with probability 1 or not at all. 

In the time-honored language of gambling, the xs can be considered 
the numbers turned up in some sort of game in which a gambling system 
is applied to choose which plays to bet on and which to reject. For 
example, if x; can take on two values 1 or 0 corresponding to red or black 
in roulette x’ might be the first number after the first 1, x,’ the first after 
a run of two 1’s, and in general x; the first after a run of j 1’s. This 
gambling system, in which the usual gambler would bet that the x;"s have 
more chance of being 0 than the xs, has an important advantage; there 
is a longer and longer wait between bets, that is, between plays that are 
accepted, and there is thus more and more time available to the gambler 
to think and reform before he loses his money, and to study probability 
rather than gambling.t The disadvantage, or rather lack of advantage, 
of this system, is that like all other systems it leaves the gambler’s chances 
entirely unaffected. This is the substance of the following theorem. 
Note again that this theorem specifically excludes the hypothesis of clair- 
voyance. Naturally any prophet can make money at a gambling casino. 
The fact that those recently discovered do not do so shows that their high 
principles nullify their supernatural advantages over mathematically 
limited (by Theorem 5.2) mortals. 

THEOREM 5.2 The random variables 2, ty, haye the same prob- 
ability properties as %, ta` * *; that is, ay’, &g,* + + are mutually indepen- 
dent, with a common distribution function, that of x,. Moreover (defining 
x, = 0 for j < 0) for each j, the sets of random variables Cees nj} 
{tn = By, Wy Mee Sa mutually independent. 

Note that, according to Theorem 5.1, applied to 4’, %25 * * (assuming 
the truth of Theorem 5.2), 


fi aoe t E{x} = Efx} 
n 


with probability 1. Tt is usually this property, that limiting ratios are not 
altered by the use of a system, that is used in the foundation studies, 


ose who would find the first less interesting than the 


+ This formulation ignores th 
re trying to make gambling profitable 


second; in fact it considers only those who a 
rather than merely fascinating. 


146 MUTUALLY INDEPENDENT RANDOM VARIABLES Tl 


rather than the more general theorem as stated, and this is of course the 
mathematical version of (B) above. 
To prove Theorem 5.2 we note that for any intervals J, © * *, Zm 


Pix (w) € y+ + +) Em (0) € Ln} 
ee: 2e Pin (w) = a; t, (0) h j= 1, m-l, 
Np(O) = am va (©) € Im} 
S oe Pin(o) = a;, £, (0) € I, j = Leet; 
nas, 


Ny) = an) Pf(02) € Imh 


since the conditions imposed on n, °° +,” 
æ;’s for 7 < am, so that the factor 


Pfx,,,(@) e Im} = P{x,(@) € Im} 


can be separated out. Summing first over a,, we now obtain, since by 
hypothesis n,, is certainly finite, 


> Pino) = a, ta (0) € Ip j = 1,+ + +,m— 1} P{xy(@) € Im} 


aL. i, Sdn 


gay a ty Waa, involve: Only: 


Repeating this procedure m— 1 times, we finally evaluate the multiple 
sum, finding it to be 


i Pix (w) € L} 
j=1 


so that the a's have the same distribution as the x's. If now we evaluate 
Pitna) €Tpk == mite m}, 
the same reasoning reduces this to 
Pit, r-1(0) € Ip k =— m, - + +, 0} Pf (w) € L}+ >- Piro) € Imh 


which proves the last statement of the theorem. 

Theorem 5.2 is commonly used implicitly in probability discussions. 
For example, let 2, ə - * - be mutually independent random variables, 
with 

Piefo)=1}=p, Plo) = 0} = 1— p. 


Let %) be the number of 0’s of the «,’s before the first 1, so that 


Pfr) = m} = (1 — pyr. 


§5 ‘STATIONARY CASE 147 


Now consider the successive times t4, t», © * between groups of l’s; more 
precisely define a, da, * * *, ty f° * * by: 


a,() = Minn such that #,(@) = 1, %,44() = 0 
a,(w) + t)(@) = Minn > alw) such that x,(@) = 0, Zay (0) = 1 
aw) = Minn > a (w) + 4(@) such that x(w) = 1, Zna (0) = 0 


It is commonly accepted without proof that the random variables 
ty + 1, ty," * * are mutually independent and have a common distribution. 
The fact that this is true as well as intuitively obvious is deduced from 
Theorem 5.2 as follows. In accordance with this theorem, Va +2» %a,+a9°*" 
are mutually independent with the same distribution as t4. The number 
of 0’s, before the first 1 for this sequence is t,— 1. Hence t, has the same 
distribution as f+ 1. Moreover according to this theorem 2,9, 
Xato * * * form a sequence independent of + > *, Xap Varr Hence ty is 
independent of the latter variables, and so of tp © * ' t-r 


CHAPTER IV 


Processes with Mutually Uncorrelated 
or Orthogonal Random Variables 


1. General remarks 


: »# ; 
In this chapter the random variables will be complex-valued, and two 

random variables equal almost everywhere will be considered identical. 
Let x and y be random variables, with 


E{|z[3} < 00,5  Ef{ly|?} <0. 
Then if 
Efag} = Efx} Eig} 
the randont variables are said to be uncorrelated; if 
Efe} = 0 


they are said to be orthogonal. If « and y are uncorrelated x — Efx} 
and_y—E{y} are orthogonal. In most discussions of uncorrelated 
random variables the variables are centered by subtraction of expectations, 
obtaining orthogonal random variables, and it is for this reason that 
throughout the present chapter we shall discuss only processes with 
mutually orthogonal variables. We shail not, however, suppose that 
their expectations vanish, since this hypothesis is usually irrelevant 
although frequently satisfied. Only the discrete parameter case is of 
interest here, just as in the case of processes with mutually independent 
random variables and for the same reason; in the continuous parameter 
case the sample functions are too discontinuous for the processes to be 
useful, y 

It is always difficult to distinguish between probability theory and 
measure theory; it is pointless even on historical grounds to do so in 
studying orthogonality. Random variables are measurable functions and 
“expectation” is a semantical obfuscation of “integral,” but here the 
deception ceases; two functions f, g are called orthogonal if the integral 
of /g vanishes. It is, however, somewhat embarrassing, and needlessly 
restrictive, to suppose that the total measure is 1, or even that it is finite. 

148 


§2 i GEOMETRICAL CONSIDERATIONS 149 


In fact, throughout this chapter we shall discuss only formal properties of 
orthogonality which do not involve the measure of the space itself. 
However, if the measure of the space is infinite, we suppose that the space 
is the union of enumerably many measurable sets of finite measure, to 
avoid integration difficulties. It is to be understood, then, that in the 
present chapter “random variable” may be interpreted as “measurable 
function,” defined on a measure space with no restriction other than the 
one just stated. It is merely for the sake of consistency with the point of 
view of the rest of the book that the theorems are stated in probability 
rather than in measure terminology. In several subsequent chapters the 
theorems on orthogonal series will'be applied to cases in which the total 
measure need not be 1. ‘ 

There are many good and readily available treatments of orthogonal 
functions, and the treatment in this chapter is therefore somewhat sketchy, 
The sketchiness of the treatment should not, however, lead the reader to 
believe that this chapter.is not a part of probability. The law of large 
numbers is a probability. theorem whether the random variables involved 
are orthogonal or whether they are mutually independent, and it is only 
historical tradition that has relegated the corresponding theorems to 
different books. It is true, however, that the probabilist is less interested 
in particular sequences of orthogonal functions, like Legendre polynomials, 
trigonometric functions, etc., than the analyst. The former is usually 
more interested in the properties shared by all such sequences than in 
those possessed by particular ones. 


2. Geometrical considerations è 

Throughout this chapter all the random variables will have finite 

second moments and the “distance” between two random variables x and 
y will be the root mean square distance 

[E{|le — y|". i 

A collection of random variables will be called a linear manifold if, 

whenever any finite number of them, say Ži,” + +, čp are in the collection, 

n 


any linear combination >, aj; is also in the collection. (In the following, 
1 


“linéar combination” always means “finite linear combination.”) The 


manifold is said to be closed if, whenever 2, %,° * * are random variables 
in the manifold and Lim. x, = v, æ is also in the manifold. This is in 
no 


conformity with the usual concept of closure, using the root mean square 


distance. 
All possible linear combinations of the random variables of a given 


collection make up anew (in general larger) collection. The new collection 


150 ORTHOGONAL RANDOM VARIABLES Iv 


is a linear manifold, the smallest linear manifold containing the given 
collection; ‘it is called the /inear manifold generated by the given collection. 
If in addition all possible limits in the mean of sequences of random 
variables in this linear manifold are adjoined to the manifold, a new (in 
general larger) linear manifold is obtained. This new linear manifold is 
closed, and is the smallest closed linear manifold containing the given 
collection; it is called the closed linear manifold generated by the given 
collection. 

Let I be any linear manifold and let x be orthogonal to every random 


variable in it. The variable v is then said to be orthogonal to M. The” 


variable 2 will then automatically be orthogonal to the closed linear 
manifold generated by M. In fact, if y = Lim. y, and if y, is in M, 
Ro 2 


No 


BELY — yyl’) 


` [Ela = [Eg — 9,)}|? < Efl 


and the second factor goes to 0 when n —> œ, 

If x is orthogonal to all the random variables of a certain collection, it 
is obviously orthogonal to the linear manifold (and hence tothe closed 
linear manifold) they generate. : 

Two linear manifolds are said to be’orthogonal if every random variable 
in one is orthogonal to every random variable in the other. 


3. General definition of projection 


Let Y1, Ye,* + * bea finite or infinite sequence of random variables. If 
a. a ee 
3 Efy;Ii} = 95x ( f ) > 
> n PVI i 


the variables are said to form an orthonormal sequence. If x = > ajy; 


1 
and if term by term integration can be justified, E{xġ,} = a,. Itis there- 
fore natural to consider for any random variable x the series + ayy, with 
a,'s defined in this way. We write 


(3.1) t~ Day; a; = Eft). 
F . 


The coefficients 4, d,- ` « are called the Fourier coefficients of x, and the 
Sign ~ means merely that a corresponds to the indicated series, called 
‘the Fourier series of x with respect to the given orthonormal sequence. 
We have already observed that, if any series 2 by; converges to x, then, 


if the series is well behaved (for example, if it is only a finite series), the 
b's must necessarily be the Fourier coefficients. "The Fourier series of x 
may converge, however, without converging to æ. For example, if 


; * $ j - 


§3 GENERAL DEFINITION OF PROJECTION 151 
é « 


probability is defined by a uniform distribution (that is, constant density) 
on the interval 0< E< 2r, 


V2cosé, V2siné, V2cos2é, V2 sin 26, 


is an orthonormal sequence. The function x = 1 is a random variable 
all of whose Fourier coefficients vanish, $ 


1~>0=0. Ñ 
$ 


Let wi, Wọ ** be a finite or infinite sequence of 3S variables, 
(not all vanishing almost everywhere). It is possible to orthogonalize this 
sequence, that is, to find an orthonormal sequence y1, Ya, °° * of random 
variables with-the property that each y; is a linear combination of wy's 
and conversely. Then the linear manifold generated by the y,’s is the 
same as that generated by the w;s. The orthogonalization will be 
sketched briefly. Dropping any w; which is a linear combination of 
preceding w/’$ we can suppose that the w;’s are linearly independent. 
Define g, by n j 


Efw Wa Ew W;-1} 


Itis easy to verify that ĝ; is a linear combination of wy, * ` *, Wj, orthogonal 


to w,,* * +, Wj- and therefore to fy, * * *, js, and that, because v “Wy 
are linearly independent, E{|7,?} #0; we set y; = HELLA ; 
Let w, © * *, w, be n random variables, not all vanishing almost every- 


where, and let W be the hpr manifold generated by the w,’s, consisting 
« n 

of all linear combinations > cw; Orthogonalizing the w,’s, msn 
1 


random variables y1 * * *, Ym are found in M forming an orthonormal 
sequence which generates Mt. If the “wps are linearly independent, 
m=n. Any random variable x in M can be written in the form 


. 
(3.2) £= È ayy " 
m 3: 


and we have seen that the a,’s must be the Fourier coefficients ofw. If 


x’ is a second random variable in this manifold, r 
. a= > Oil a ' 
we have Wei , $ E è 
G3) es ol a 
<% S 3 “t evos oor 


152 ORTHOGONAL RANDOM VARIABLES IV 


Thus each random variable x corresponds to a vector (a}, * * *, am) with 
m complex components, and distance between random variables is the 
usual Euclidean distance between the endpoints of the corresponding 
vectors. Linear combinations of random variables go over into the same 
linear combinations of the corresponding vectors, so that a linear manifold 
of random variables corresponds to a linear manifold of vectors, geo- 
metrically, to a plane through the origin. Closed linear manifolds 
correspond to closed linear manifolds. In particular, the manifold WM 
itself corresponds to the whole vector space, and is therefore a closed 
linear manifold. Moreover 


(3.4) Bee = > a,a;, 
j 


so that orthogonal random variables correspond to orthogonal vectors. 
The random variables y,, * * +, Ym correspond to the coordinate vectors 


(1,0,* + +,0),* + +,@,- + +,0, 1). 


If œ is any random variable in M, a; = E{xj,} is, in terms of the vector 
picture, the component of the vector corresponding to x on the jth 
coordinate vector, and ayy; goes into the projection of the former vector 
on the latter. More generally, if M, is the linear manifold of random 
variables generated by y1, - * +, Yr the Fourier series for x in terms of 
these y,’s goes into the projection of the vector corresponding to æ on 
the plane corresponding to M,. This geometric picture clarifies the 
following remarks, in which we assume Mt and the y,’s as in this paragraph, 
but also consider random variables not in M. 
If x is any random variable, and if 
T > Ys = ti a; = Efx}, 

then v = a with probability 1 if v is in W, and in any case the following 
relations hold: 


(a) Ef{{2,|7} = Dy la}. 
d 
This is verified by direct calculation. i 
() Efl} = > Jal? + Elle — a). 
J 


This is verified by evaluating E{|x — x,|?}, and, in the m-dimensional 
picture, is simply the Pythagorean theorem. In particular, 


Eile} = 2 |a;). 


This inequality is known as Bessel’s inequality. 


§3 GENERAL DEFINITION OF PROJECTION 153 


(c) The random variables v and x, have the same Fourier series. 
Hence x — 2, is orthogonal to every y; and consequently to every random 
variable in M. Conversely, if x, is in M and if x — z, is orthogonal to 
every Yj, Xo =, with probability 1. In fact, then x and x, have the 
same Fourier coefficients so that a, (in M) is the sum of the same Fourier 
series as 2. 

(d) The random variable x, is in Mt and is closer to v than any other 
random variable in M (counting random variables as the same if they are 
equal with probability 1). In fact, if x, is also in M, x, — wy is in M and 
therefore orthogonal to x—2,. Hence. 

E{ |x — x|} = E{|(w—24) + @ = tal) 
= E{|x— x, |} + E{|x, — v|} > E{|e — z|} 
and there is equality if and only if x, = %a with probability 1. 

In II we have called a, the wide sense conditional expectation of w 
relative to Y1, * * *, Ym (OF Wy * * s Wn) 

a, = Prl yn s Ym) = Efa]wy,* + + Wa} 


We shall sometimes write 


x, = Er |DY, 
or we may even put for WM any collection of random variables which 


generates the closed linear manifold M. 
Two trivial cases have been excluded above. If Wt contains no random 


variables, or only random variables which vanish almost everywhere, we 
define Bio] 97} = 0. 


Let wi, Wọ * * + be an infinite sequence of random variables, generating 


the closed linear manifold M. By orthogonalizing the w,’s an ortho- 
normal sequence y1, Yə ` * * can be found, in the linear manifold generated 
by the w,’s, which also generate M. We shall suppose that there are 
infinitely many linearly independent ,’s so that there are infinitely many 
ys; otherwise the present case would be the same as the preceding one. 
We wish to show that all the considerations of the preceding CROCS) 


over to the present one. To show this we remark that, if « ~ 2 AY js 
Bessel’s i i 

essel’s inequality Eile} > 3 lal? 

j 
still holds, since it has been seen to hold for any finite number of the y;'s. 
We shall now show that every random variable x in W can be written in 
the form 
8.2) t= > ais Ma E({xj;}, 
I 


154 ORTHOGONAL RANDOM VARIABLES Iv 
where > |a,|* < co and the sums converges in the mean, that is, 
j 
n 
x= lim. > ayp 
no 1 
and we shall show that conversely, if X |a;|* < œ, X ayy; converges in 


the mean to a random variable in M, whose Fourier coefficients are the 
aps. We show first that, if > |a,|* < œ, the series X ayy, converges in 
3 


the mean, by verifying that the Cauchy condition for convergence in the 
mean is satisfied. In fact, 


E{|> ay,— > ay?) = El > ay} = > la? 
1 1 m+t m+ls 
and the last sum goes to 0 when m, n-» 00. The series > ayy, thus has 
j 
a sum, which is certainly in M. Conversely, if æ is in Mand ifx ~ > ay; 
j 


then X |a,|* < co by Bessel’s inequality, so that the Fourier series con- 


verges (in the mean) to some random variable x, in WM. Moreover x, 
has the same Fourier coefficients as 2, since, if n > k, 


Ela} = ES a)y + EXC Š awdd 


= as + ES awdi 


here the last term is 0 since y, is orthogonal to #44, Yn+2s* * * and there- 
fore to the sum = RL which is in the closed linear manifold generated 


by these y,'s, We have now proved that x — x, has Fourier series 0, so 
that a — x, is orthogonal to the y,’s, and therefore to WM. In particular, 
æ — w, is orthogonal to itself, 

E(|z—,|*} = 0, 
that is, x = x, almost everywhere. This proves that if x is in M it is the 
sum of its Fourier series (convergence in the mean). 

Thus (3.2) goes over to the case of infinite series, and statements (a), 
(b), (c), (d) are true also for infinite series; since they really depended 
on (3.2) their proofs need no change. 

We have already defined the symbol E{x |M} if M is generated by a 
finite number of random variables. In the same way, if M cannot be so 
generated but is generated by a denumerably infinite sequence of random 
variables, we define E{x | M} as x,, the random variable in M closest to a. 


§4 SERIES OF ORTHOGONAL RANDOM VARIABLES 155 


(Note that this geometrical property implies that x, does not depend on 
the particular orthonormal sequence used to define it.) The random 
variable £2 | M} is sometimes called the projection of x on M. We have 
seen that from the point of view of this b8ok it is the wide sense conditional 
expectation of x relative to the variables of WM. The above treatment is 
easily extended to include the possibility that M may not be generated 
by any denumerable sequence of random variables. It will sometimes 
be convenient to replace W in the notation Eja | M} by any collection of 
random variables which generate Mt.» This agrees with our earlier 
definition of Efx |w, ++) Wp} as a certain linear combination of 
Way 05 Wine 

Finally we show that £{— | —}and E{— | —}satisfy analogous functional 
equations. (In the following it is to be understood that the equations 
are to be true with probability 1, not identically.) In the first place 


fox |M} = Pix | M} 
Bix, + ta | MY = Efe, | MY + Erg | M}. 


These equations follow, for example, from the linearity property of 
Fourier coefficients. In the second place, to prove the analogue of I, 
(10.8) we prove that if W, and M are any collections of random variables 
(with finite second moments), with WM, C Ma, 


Pf Maj] W}) = Bfe 1M). 
To prove this we prove that 
Ele — Bf | Ma) 10! = 0. 
Now x — fæ | My} is a random variable orthogonal to the closed lincar 


manifold generated by Ma., Hence its projection on WM, is 0, as was to 


be proved. 
Finally we remark that, if M, is orthogonal to Ma, 


Pee | My, Ma} = Pfr | Wy} + fe | Mi). 


This fact implies that Bfx | M} is unchanged if random variables ortho- 
gonal to a are adjoined to WM or if any collection My of random variables 
in M is replaced by the single random variable E{a | Mo}. 


4. Series of orthogonal random variables 
THEOREM 4.1 Let Yy Ya * * * be mutually orthogonal random variables, 


with E{|y,|*} = 0,2 Then the series Sy, converges in the mean if and 
1 


only f5 g, < 0. 
1 


156 ORTHOGONAL RANDOM VARIABLES Iv 


The theorem is a consequence of the evaluation 
a m n n 
E(|È y — Zyl} = El > w= È o?. 
1 1 m+1 m+ 


Alternatively the theorem can be reduced to theorems on Fourier series 
by the remark that y,/0,, Y2/0z, * * * is an orthonormal sequence (deleting 
the terms for which ø; = 0). This theorem is the weak sense version of 
II, Theorem 2.3. 

For many purposes it is desirable to have conditions under which 
> y; converges with probability 1. Theorem 4.2 gives one set of such 


d 
conditions. Its proof depends on a preliminary lemma. 
Lemma 4.1 Ify, +++ ++ yp, are mutually orthogonal, with 


Bily,7} = o, 
then 


tora 
aD B(Max|y, +++ + ul} < (PE 
jan 


log 2 


2 
CERET 


Note that if the left side were replaced by E{|y, ++ ° + y,|?} it 
would be equal to (0? +- + * + 0,2). The Max is made possible by the 
factor (log 4n/log 2)", If n= 1, (4.1) is true with the factor decreased 
to 1. Ifn> 1, define the integer r > 0 by 


Y <n Han 


and define y, = 0 forn <j <N. Let sbe the sum of all partial sums 
of the form 


era bes yal a= 42" B= (u + 12 
T= reel ASN teem — I, 


Then for each » the sums in s for that value of » have total expectation 
(o2 +: + ++ 0,2), so that 


Ef} = (r + 2)loP ++ > + 0,7}. 


Now we can separate yı +-* : +y; into partial sums of the type 
Yaşı +* * * +HYp With a, B as above, say 


a i Neale 


where 7; contains 2” terms, r + 1 > >r, >' *' >r 20. Then by 
Schwarz’s inequality 


` 
Po yP nae 2)s. 


$4 SERIES OF ORTHOGONAL RANDOM VARIABLES 157 


Hence 
E(Max la te F< (r+ DES} < (r + 2)*%o? Ho) 
jen 
logn F ; 3 
(i + 2) (oP a o a 


as was to be proved. 
THEOREM 4.2 Let Yy, Ya ` * * be mutually orthogonal randon variables, 


with Ef|y„|} = 0,2. Then, if > a, log?n < œ, the series 3 Yn converges 


in the mean and with probabinsy 1. 
Convergence in the mean is assured by Theorem 4.1. Let the sum be 


n 
xand lets, = > y; Then 
jal 


æ% ià o 
E a, d = 2< i ge oè log j. 
le-a 3 ocg TA oes 
It follows that, if n = 2”, 
s a am < $ Clg D” 
2, P{|2(w) — x,(@)| = (log n) ys 2 i (log 2° 


so that (Borel-Cantelli lemma) 


(4.2) |x(w) — vlo) < (logn n= 2" 
for sufficiently large n, with probability 1. Moreover, according to 
Lemma 4.1, 
‘log 4(n — = ant 
43) Ef M Eai <( uh ap 
GA) $ n< ars Im ` I" log 2 Za ; 


log 4(n — 1) T Srei BA j 
x log? 
oer +1) 2, gig 


Š 3 ot o; log?j = € 
log? 2 jan'+1 Tey, bs 


forn> 1. Here, ifn=2", > & <%©. Choose dy, 64° * + so that 
r=1 


§,>0, lim %&=0, > <0 (n=2"). 


noo r=] Yn 


Then (4.3) yields 
5$ P{ Max |tn (o) — x, (o)| > 8} < È Zz < o. 


r=1 n<m<2n n 


158 ORTHOGONAL RANDOM VARIABLES Iv 


Hence (Borel-Cantelli lemma again) 

(4.4) lemo) — 2,(@)| < 6,2, n= 2" <m<2n 

for sufficiently large n, with probability 1. The combination of (4.2) 
with (4.4) gives convergence of D Yn with probability 1. 

Note that if the y„’s are ON and pave zero expectations the con- 
dition of Theorem 4.2 can be weakened to 5 o, < co (III, Theorem 2.3). 
5. The law of large numbers ; 

We shall see in this section how little the quantitative hypotheses of 
some of the theorems on the law of large numbers for mutually indepen- 


dent random variables (III §3) need be strengthened if only mutual 
orthogonality is presupposed. 


THEOREM 5.1 If Yy Ya, ` + are mutually orthogonal random variables, 

with E{|y;?)} = 0, 
(5.1) rye 4 Laas 

n> 0 n 
if and only if 

In na 2 

6.2 eee ee ey 

n> n 


The theorem is obvious from the evaluation 
rf Bagh OF = Bit tees Fea 


n | n 
it is the weak sense version of III, Theorem 3.2. The part of the latter 
theorem connecting III (3.8) and (3.9) is still true, but is of less interest 
when the y,’s are supposed merely mutually orthogonal. 
According to III, Theorem 3.4, if the y,;’s have zero expectations and 


are mutually independent, and if > a,2/n® < œ, the strong law of large 


1 
numbers holds. Under the present hypothesis of mutual orthogonality 
somewhat more severe restrictions must be imposed on the ¢,,”’s. 
THEOREM 5.2 If Yyy, Ya > + + are mutually orthogonal random variables, 
with E{ly;|"} = o}, and if 


Dp gn < o, 
1 
then 
63). Gan e dene 
pa n n> n 


with probability 1. 


§6 SERIES OF POWER SERIES TYPE 159 


As in the proof of Theorem 3.4 of III, it is sufficient to prove that 


ay i k A è 
5 =" converges in the mean and with probability 1, and this convergence 
n 


1 
is assured by Theorem 4.2. 

This theorem is less well known than its counterpart for independent 
random variables, We therefore point out a special case: if y1, Y2, * * 
are random variables, with 


Ey,}=0, Efly.P23<K, neal 


the strong law of large numbers (5.3) holds not only if the variables are 
mutually independent (III, Theorem 3.4) but even if they are only mutually 
uncorrelated (that is, mutually orthogonal since the expectations vanish). 
Note, however, that the condition that Efy„} = 0 does not appear in 
Theorem 5.2 and is quite irrelevant. Orthogonality is the essential 
condition and E{y,} = 0 is only important in that, together with zero 
correlation between pairs of y,’s, it implies orthogonality of the pairs. 


a 
6. Series > aje”*"* of power series type 
0 
In the present section we suppose that probabilities are defined as 


length (Lebesgue measure) on the interval — $<A<}, so that 
1 en ean. ha 
, 


is an orthonormal sequence. In particular we are interested in character- 
ization of those functions of 4 whose Fourier series contain no negative 
powers of e?”, The results are sketched, and will be used only in XII, 


in the theory of least squares prediction. 
Note that, since the orthogonal sequence under discussion consists of 


bounded functions, any integrable function has a Fourier series, whether 


the second moment exists or not. P 
The following theorem is stated for ready reference. Its proof will be 


omitted. aA 
THEOREM 6.1 If f@) = > yy?" is analytic for |z| < 1 then Í | flre?™)| da 
3 -12 
is monotone non-decreasing in r. Suppose that this integral has a finite 
limit when r > 1. Then 
(6.1) lim fre’) = fe") 
rl 
exists for almost all values of 4. Moreover f(e”) is integrable, 
1/2 
(6.2) tim | E-e] da = 0 


pag H 


160 ORTHOGONAL RANDOM VARIABLES IV 


and 
(6.3) SE") ~ 2 ners 
0 
JÆ is given by the Cauchy integral formula 
uel fO dk 
(6.4) MOE | ay 
Il=1 
Conversely, if f(e™®) is any integrable function of 2, whose Fourier series 
has the form (6.3), the function f@) = > 7,2" defined for |z| <1 has all 
0 


the above stated properties. Finally, in this case, f(e"") = 0 only on a 
set of Lebesgue measure 0, or on a set of Lebesgue measure | (when 
a= N=" "= 0). 

If g(e?™) is a real integrable function of 2, with Fourier series 


(6.3) glet) w 5 y erin’, 
then g(re*™), given by 
th ý l— r 
, Qari nip = My 
Ce sve) Js ) 1 — 2r cos 2n(A— u) + r? di 


o T 
= > Yar! n| erin us 
=x 


defines a harmonic function of re" for r < 1, and 


(6.1) lim g(re?"?) = g(e2") 
rl 


for almost all values of 4. 
The next theorem provides the factoring of a spectral density, as 
required for prediction theory. 
THEOREM 6.2 If f(A) (= 0) is Lebesgue integrable over the interval 
—$<1<}, if f= 0 at most on a set of measure 0, and if 
1/2 


(6.5) J log,fa) di > — o, 


—1/2 
there is a uniquely determined sequence of numbers yo, Yı, ` * * with Yo 


© 
real and positive, > |y,?| < ©, 
0 


(6.6) 2 Yn" #0, [el <1, 


§6 SERIES OF POWER SERIES TYPE 161 


1/2 


(6.7) log yo = $ | log f(A) di, 
and Toe 
(6.8) fd) = [S pe. 

0 


(The series converges in the mean.) The y,,s are determined by the relations 
w 
Hogs) ~ > ae" 
2o 


(6.9) 


æ mtae anz" 
Sya e enol 
0 


a 
Conversely suppose that yo, Yı, * * ` are any numbers with 0 <> |y,|? 
0 


< œ, and that (6.7) is true. Then, if f(A) is defined by (6.8), f = 0 at most 
on a set of measure 0, (6.5) is true, and in fact 

1/2 
(6.7) log [yo] < 4 log f(a) da. 


=1/2 


Note that, for f> 1, log f < f. Hence if f is integrable P is 
integrable unless fis too small, that is, the AINE of log fis finite or 


If | Al? =f fı has a Fourier series > yne”, with È [Yal < 00. 


Thus o 4 
fa = | > Mie dal 


The point of the theorem is that f} can be chosen so that y, = 0 for 
n < 0 if (6.5) is true. 3 
The proof of the theorem relies on Theorem 6.1. If (6.5) is true, 


log Vf has a Fourier series, 
log Vf) ~ > ae", an = Gye 
Let g be the function o 
ZO = % + 2 È 4,2"; 
1 
analytic for |z| < 1, whose real part 


Rg(re”) = S ailgen 


162 ORTHOGONAL RANDOM VARIABLES IV 


is the harmonic function with boundary function log V f(A) (see Theorem 
6.1). Finally define y(z) = #®, Then y(z) 40 for |z| <1 and 


(6:10) lim [ye] = lim eta) _ gee VID _. VA) 
r= r=] 


for almost all values of 4. The function y(z) has a power series develop- 
ment 


a+2 Ea az" 


(6.11) y@) = > aa" Se kien 


To show that this can be extended to |z| = 1, in the sense that (e274) is 
defined as the radial limit of y(z) and has the Fourier series > ynerm’, we 
show that Theorem 6.1 is applicable: 

1/2 1/2 


(6.12) / pre] da = eRe ih 
-1/2 Ey 


1/2 
1/2 Vita = A 
ite log VIC) aroos aaa FE 


du 


1/2 1/2 


< fa i Vio - 


1—r 
1— 2r cos mA ute tr 


1/2 1/2 


= / Vf) du | ‘ 
1/2 


=1/2 


* Sr dì 
— 2r cos In(A— u) + 


1/2 


| V flu) dp. 


=1/2 


Thus the integral on the left remains bounded when r —> 1, which means 
according to Theorem 6.1 that y(z) has a boundary function lim y(re?""*) 


= »(e*") for almost all values of A, and that y(e***) has the AC: series 


3 ye", Since, as we have already seen, [xe] = VFA), it follows 


86 SERIES OF POWER SERIES TYPE 163 


i ca 2 
that |y(e?"”)|? is integrable, so that > |y,|* < œ, and thus we have 
0 


obtained the desired representation [in which (6.7) follows from (6.9) 
with z = 0] 


faA= jyer) =| yer 


The uniqueness of the y,’s will be proved after the proof of the 
converse. 


Conversely, suppose that f is given by (6.8), where 0 < > [Yal < ©, 
za 0 
and suppose that y(z)= > y,2" does not vanish for |z| <1. Then 
o 


log |y(2)| is harmonic and 


1/2 
(6.13) log |(0)| = log [yo = f log lyre") | dan <1. 
1/2 


Now according to Theorem 6.1 y(e*"”) is defined for almost all 4 in terms 
of radial approach to |z| = 1, and is the function with Fourier series 


2 y eri", Let h(x) = Max [x, b]. Then 


2 


1/2 
lim f M (log |y(re**”)] da = i h, [log |y(e2"”)|] da, 
ml — 12 


1 =1/2 


because the square of the integrand on the left is less than |y| for large 
values of |y|, and the integral of |y| is bounded when r > 1. Hence 


1/2 


log [yol < | fy log |y(e***)|] di. 
— 1/2 


When b->—o in this inequality, it yields (6.7'). Note that a trivial 
variation of this discussion shows that log |y(@)| < az), where a(z) is the 
value at z of the harmonic function determined by the Poisson integral 
with boundary function log |y(e""")|. The discussion treated the case 
z=0. Since the difference a(z)— log |7(z)| defines a non-negative 
harmonic function, this difference either never vanishes or vanishes 
identically. Thus, if (6.7’) is known to be true, so that this difference 
vanishes when z = 0, we have now shown that log |y(re"")| can be 
represented by the Poisson integral with boundary function log |(e*”). 
This fact is used in the following uniqueness proof. 

To prove the uniqueness statement of the first part of the theorem. we 
must show that the hypotheses that f(A) is given by (6.8) with yn = Yr 


164 ORTHOGONAL RANDOM VARIABLES Iv 


and also with y, = y,”, and that (6.6) and (6.7) are satisfied by both 
representations of f(4) imply that y,’ = cy,,”,n > 0, where [el =1. Define 


a 
MA) = 2 Yn'2 


z) = 2 Yaz". 
D 


It will be sufficient to show that these functions are proportional. Now 
log y,(2), log y(2) are analytic for |z| <1, and we have seen above that 


A 
log |7(2)| is determined by its boundary function log | > y,/e?7""4|; that 
i 0 


is, log |y,(2)| = log |y.(2)|. Then log y,(z) and log y(z) can differ at 
most by an imaginary constant, i, so that 


V) = ey), 
as was to be proved. 


7. Martingales in the wide sense 


Martingales (wide sense) were defined in II §7 as processes whose 
random variables {2,} have finite second moments and satisfy the equation 


(7.1) Ble, [es Ti =X, 


(with probability 1) whenever 4 <+- < taşı; here n is an arbitrary 
positive integer. This is true if and only if %,,,,—%,, is orthogonal to 
x, for j< n, and, since these t,’s are arbitrary, %,,,, — x, must be orthog- 
onal to every x, for t< t, In other words, (7.1) is equivalent to the 
condition 


TA Efe, |x,,r<s}=2, 


with probability 1 whenever s < t. In the present section ¢ will range 
through the integers, or some of the integers, sometimes including + 00. 
If ¢ is integral it is sufficient if (7.1’) is replaced by 


(7.1) Efex, lan Ty} = Tyg. 


Note that the defining properties (7.1) and (7.1) are meaningful whenever 
the range of f is an ordered set, and that any set of random variables of 
a martingale (ordered as before) also constitutes a martingale. 


THEOREM 7.1 If the random variables {x} constitute a martingale (wide 
sense) then, if ty < to, 


(7.2) E{(x,,?} < Ele, 


and there is ‘equality only if x, = x, with probability 1, 


87 MARTINGALES IN THE WIDE SENSE 165 


In fact, according to the martingale property x, — x, is orthogonal to 
2%, sO that 
Ef], P = Ef\@,, — ta) + zal) 


= Efle,, — vnl} + Elle} > Elle! 


and there is equality only if x, = x, with probability 1. 
If the random variables xy, tə © * * constitute a martingale, and if y; 
is defined by 
Yj=%j—%y fol 


=t i=1 


then the martingale property implies that the y,’s are mutually orthogonal, 
and n 
Ca = 2yr 


Conversely, if the y;’s are any mutually orthogonal random variables and 
if x, is defined in this way, the x, process is a wide sense martingale. 
This would make it appear that the whole subject of martingales in the 
wide sense might well be forgotten since the variables are simply the 
partial sums of a series of orthogonal random variables. However, the 
natural way they arise in Hilbert space considerations and the light they 
shed on problems of least square approximation (XII) and on martingales 
in the strict sense (VII) makes it worth while to study their properties 
briefly. The most interesting case is the continuous parameter case (IX). 

THEOREM 7.2 Let tı ta ** * be random variables constituting a 


martingale (wide sense). Then 


(7-3) E{|x, |} < Elley -° 

If lim E{|x,|?} = 1 < 00, Lim. tn = to exists and the random variables 
n>n n> 

Zis Uy, * * "s Zeo constitute a martingale (wide sense). 


The strict sense version of this theorem is VII, Theorem 4.1 (i), (ii), (iii). 
To prove the theorem we need only remark that, if a, is written in the 
above form with mutually orthogonal y,’s, 

n 
E{le,l?} = 2 E{|ys|"}- 
Hence (Theorem 4.1), if / < 2, Lim. 2, =2,, exists. To prove the last 
na 
statement of the theorem we must prove that £a — ty is orthogonal to x; 
for j<n. The difference 2, —%, (m > n) has this property, and when 
m —> co we obtain the desired result. 


166 ORTHOGONAL RANDOM VARIABLES IV 


THEOREM 7.3 Let ++ +, Xy, a be random variables constituting a 
martingale (wide sense), let IM, be the closed linear manifold generated by 
1 


< -En ayy and let M=AM,. Then 


(7.4) hem obs, [DY (=) 
n> ® 
and the random variables x_,,, * * *, ©, v constitute a martingale (wide 
sense). 
The strict sense version of this theorem is VII, Theorem 4.2. Let 
Yn =€,—%,4. Then the y;s are mutually orthogonal and (Theorem 
4.1) the existence of the limit in the mean 2_,, will follow from the con- 


oo 
vergence of the series > E{|y_,|?}. This convergence is a consequence of 
1 


G3) S By} = Elea- zaal) 
< Elle} + Ellen?) 
< 4Ef|e |}. 


To show that a_,, is the projection indicated in (7.4) we note that the 
projection is characterized by two conditions: it is in M (as is a_,, since 
L_a is in every M,,), and 2, — E{a_, |M} is orthogonal to M. To show 
that £- — a_,, is orthogonal to W we need only remark that x; — 2, is 
orthogonal to M, Dt and hence to M; going to the limit, n ——— © 
gives the desired result. (Of course, v- = E{x, | M} for every k.) To 
prove the last statement of the theorem we need only prove that if m <n 
then 
Pien [Zor s Em) = Em 


with probability 1, that is, 2, —2,, is orthogonal to t-on = "s Zm By 
hypothesis x, — X, is orthogonal to the random variables + + +, #1, tm 
and to the closed linear manifold M, they generate, which includes x_,,, 
and the proof is now complete. 


According to Theorem 7.3 if the random variables > + +, #4, %,° < 
constitute a martingale (wide sense) they always can be written in the form 
== > Yi 


isn 
where the y,’s are mutually orthogonal, j > — oo, and the series converges 
in the mean, In fact we can take 


Yo =e, YSE j>. 


Conversely any sums of this form define a martingale (wide sense). 


§7 MARTINGALES IN THE WIDE SENSE 167 


THEOREM 7.4 Letz be a random variable with a finite second moment, 
and let ++ -CM CM, C: ++ be closed linear manifolds of random 
variables. Let M_n = NM, and let Ma be the closed linear manifold 


generated by the random variables in U M,- Then the random variables 
n 


Bfe [Mah + + Be | M), Be [Mah <, Be Ma) 
constitute a martingale (wide sense), and 


Lim. Efz | M,} = Efe | M-o} 


no 


Lim. Efe | M,} = Efe | M..} 


n= 


(7.6) 


with probability 1. 


Define 
a, =Ez|M,}, ~ons o. 


Then v, is a random variable in the manifold M,, and, if m < n, En — Um 
is orthogonal to M,,, and therefore to every #; with j< m. In other 


words 
Em = Bly | + * Za Zn} 


with probability 1, that is, the 2,’s constitute a martingale in the wide 
sense. By Theorem 7.3 the first limit in (7.6) exists. To show that the 
limit æ is the projection X- we note that the projection -a is character- 
ized by two conditions: it is in M_,, (as is x, since x, € M,,), and z— t-n 
is orthogonal to M_s (as is z — x since z — x, is orthogonal to M, and 
therefore to M). This completes the proof of the first line in (7.6). 
To prove the second line we note that, by Theorem Tai 


E A Pilea l} 
so that the mean limit in the second line exists, by Theorem 7.2. Just as 


in the preceding proof, the limit is then shown to have the properties 


characterizing the projection #,.. f ) 
CoROLLARY 1 Let z, Yy Yz ` ` ` be any random variables with finite 
second moments. Then if Ry, is the closed linear ynanifold generated by 


the y;'s, with j =n, 
Li.m. Ez | Yns Yny * } = EZ | Mn} 


n>n 
(7.6') $ g 
Lim. Ef | yp ° > Yat = Ee lyo Y» 3 
with probability 1. In particular if z is a random variable in the closed 
linear manifold generated by the y;'s, the second limit can be replaced by z. 


168 ORTHOGONAL RANDOM VARIABLES IV 


To reduce the first of these equations to the first in (7.6) identify N, 
with M_,. To reduce the second to the second in (7.6) identify with M, 
in (7.6) the closed linear manifold generated by y}, © - °, Yp- 

The strict sense versions of Theorem 7.4 and its corollary are VII, 
Theorem 4.3, and its corollary. The study of projections that we have 
made in this section is really a study of one aspect of the geometry of 
Hilbert space. Since we have proved all the results to be used, we stop 
the study at this point. 

Equation (7.6’) is of fundamental importance. As one application 
suppose that z is actually in the closed linear manifold generated by the 
y;8. Then, according to (7.6’), if one approximates z in the sense of least 
squares, in terms of linear combinations of y;, * * *, Yp, the (least squares) 
error approaches 0 when n —> œ. 

As an instructive example we apply (7.6’) to a very special case. Sup- 
pose that the sequence {y,’} is an orthonormal sequence of random 
variables. We apply (7.6’) with the following identifications: 


z=" 
Un ieee 


Now the closed linear manifold N, generated by Yn, Yn41,* * * is the same 
as that generated by yy’ +- + © + Yn, Ynyr» Ynags* * ° These random 
variables are mutually orthogonal, so any random variable in N, can be 
written in the form 


ayy! Ho + yp’) + 2 ayj, È lal? < o, 
n+ 


n+l 
where the series converges in the mean. If the random variable is in 
rl atk eer 
N= NAN, it is in every N,,, so we must have a; = a for all j, and this is 
1 
incompatible with the convergence of > |a;|? unless the a,’s vanish. In 


J 
other words, R contains only the null random variable. We therefore 
deduce from (7.6’) that 


AN Liam. By,’ |R,} = Lim. By’ [Yn Yn +} = 0. 
pees 


n> 00 


Now 
Ey’ |} = Bly [yy + + nls Yn Yna =} 


= Ey! [yl +++ + yn'} 


§7 MARTINGALES IN THE WIDE SENSE 169 


because Yni1’s Ynsy» ° ` are orthogonal to y, and yy +' * ++ Yn and 
can therefore be ignored in the calculation. Finally by symmetry 


Bay lw +--+ yn} =EB [yi testy} jan 


12 
== 2 Ey lt + E 


ya + , d 
= f: Sv, i -+y| 
J 


This projection can, of course, also be evaluated by a simple evaluation 
of the appropriate Fourier coefficients. Combining this evaluation with 
(7.7), we find that 


Lim. E ENT =0, 
n-> o n 


a result that is hardly surprising since 


EET 
E 


and the limit equation simply states that this mean square expectation 
goes to 0 when n > œ. This rather ridiculous proof of a trivial case of 
the law of large numbers has been given for two reasons. In the first 
place it illustrates the application of the limit theorems proved in this 
section. In the second place the strong sense version of this theorem is 
an important and non-trivial theorem (III, Theorem 5.1), an important 
case of the strong law of large numbers for strictly stationary processes, 
which can be proved by giving the strong sense version of the above proof. 


(See VIT §6.) 


CHAP TER Vi 


Markov Processes— 
Discrete Parameter 


1. Markov chains—definitions 


A Markov chain is defined as a Markov process (II §6) whose random 
variables can (with probability 1) only assume values in a certain finite or 
denumerably infinite set. This set is usually taken, for convenience, to 
be the integers 1, - - -, N (finite case) or the integers I, 2, + - + (infinite 
case). We shall assume that these choices have been made. Physically 
one speaks of a system which evolves through numbered states in accord- 
ance with probabilistic laws satisfying the Markov hypothesis. Let the 


random variables of the chain be a, 2, > If P{x,,(w) = i}> 0, 
define mpi; by 
A a _ Pino) = i, tma (0) =j} 
Pe nO) fen (@) — th t PE () =i wal 4 


and let „P be the matrix [,,p,,]. Then, if P{x,,(w) = i}> 0, 


(1.1) mPis = 9, È mPis = 1, 
J 
and 
(1.2) PLE no) =J | lo) = i} = È mpix mii Pes 


Since the process has the Markov property, the second factors in the sum 
do not depend on 7; [In this equation it is understood that if mpx; is 
undefined, because P{x,,,,(w) = k}—=0, the corresponding summand 
mPix m+1Pr; ÍS to be taken as 0; we observe that, for such a value of 
k, mPix = 9.] Thus the matrix of probabilities of transitions from 7 at 
time m to j at time m + 2, that is, from i at time m to j in two steps, is 
the product „P mP of the matrices of one-step transitions. In many 
applications „P is independent of m so that we write P: [p;] for 
170 


$1 MARKOV CHAINS—DEFINITIONS 171 


mP: [mpi] and speak of stationary transition probabilities. In this case, if 
pis”) is the probability of a transition from i to j in n steps, 

[pu] =p; 
so that 


(1.3) Paan = > pix! pris (Rmin Es P”P*). 
k 


In practice matrices [mp] are given, satisfying (1.1), and a Markov 
process is defined which has these matrices as transition probability 
matrices. This process is defined by choosing initial probabilities 
P{x (w) = i} and defining 
Pix (w) = at ts z(o) = an} $ P{x(w) = ay} 1Pa,a, 2Payas* * * n-AP ay 12° 
With this definition of ty, %,° * * probabilities, the x, process is actually 
a Markov process with the given transition probabilities. More precisely, 
if with this definition Pf{x,,(@) = i}> 0, then mpy is the transition 
probability 
EAR O ia [@n(@) = i}. 

Whether P{x,,(w) = i} is positive or not depends both on the initial 
probabilities and on the given transition probabilities. 

According to II §6, any Markov process reversed in time is still a 
Markov process. In the case of a chain, the reverse one-step transition 
probabilities are easily calculated, and we find 

y z Pitlo) = J} mPas 
P(x,(0) =J lema) = 3 = “pew i) 


= Pieno) =j} mPis_, 
2 Plen(o) = k} mpri 


where 


P{x,,(@) =j} S P{x (w) = a} 1Paa, 2Pasa © © mAPanmt 
ayta 


sAm- 


Note that the reverse transition probabilities depend on the choice of the 


initial: probabilities. ; 
According to the previous discussion any matrix [p,,] satisfying the 


conditions 
(1.4) Pu 29 > Pu =} 
J 
can be used, together with initial probabilities {p;} satisfying 


(1.5) p20% YH=h, 


172 MARKOV PROCESSES—DISCRETE PARAMETER y 


to define a Markov chain with stationary transition probabilities. A 
matrix [p;] satisfying (1.4) is called a stochastic matrix. The n-step 
transition probabilities, which also determine stochastic matrices, are 
defined successively by 


Pi = Pu 


(1.6) 
Po È Pa” Prs = > Pir Pus” 

and satisfy (1.3). The probability P{z,(w) = j}, the probability of state 
jat time n, is given recursively by 


Ps? = Pi 


(1.7) 
PP S= EPP", n>. 

The quantities {p,™} are called the absolute probabilities of the process. 
In particular, if p; = p,® = p; for all j, so that 


(1.8) ) > PiPiss 


which implies p, = p,® =: + +, the p;’s are called stationary absolute 
probabilities. Tt will be shown below that in the finite dimensional case 
(that is, when there are finitely many states, so that the matrices involved 
are finite dimensional) there is always at least one set of stationary 
absolute probabilities. There may be none in the infinite dimensional 
case. 


2, Finite dimensional Markov chains with stationary transition 
probabilities 


In this section we shall consider finite dimensional chains with NV < œ 
states. The transition probabilities will be stationary, and the problem 
to be discussed is, in the notation of §1, the determination of the asymp- 
totic character of p,,\") for n—> co, that is, the determination of the 
characteristic properties of the system under examination after a long 
lapse of time. In particular, an analysis of the dependence of the long- 
time characteristics of the system on the initial states is desired. 

The problem will be considered in various special cases leading up to 
the general case (d), and the results will then be specialized to further 
special cases. 

Case (a) pi; = p; is independent of i. 

The condition imposed means that the random variables 21, s, * * - of 
a Markov chain with these transition probabilities are mutually indepen- 
dent, that is, the system assumes its successive states as a result of 


§2 FINITE DIMENSIONAL MARKOV CHAINS 173 


independent trials of some experiment. For example, the system may be 
a die, with 
Pie Neem be 


The probability of a 4 on the (m + 1)th toss is $ = p4, regardless of the 
results of previous tosses. In Case (a) it is intuitively clear, and trivially 
verifiable, that 

Pi =p =P N=1,2,°°% 


and the problem of this section is thus trivial. Moreover, no matter how 
(p;"?) is chosen, 


N N 
pil”) <= epee pee a i 
i= i= 


The process is entirely unaffected by the initial conditions after the initial 


step has been taken. 
Case (b) There is an integer v > 1 and a set J of N, > 1 values of j 


such that 
Min pu” = 6> 0. 
1sisN 
gel 


In this case there are numbers py, * ` *s py Such that 


Q) lim py =p, JH LN, Pi Z Ô, j eJ; 


n> o 


Pi © +s py form a set of stationary absolute probabilities. Moreover, 


C2 pu = p| < 0 — Maer 


Thus in Case (b) the long-run properties of the transitions make the 
process like that of Case (a) in the sense that no matter what the initial 
distribution there is probability approximately p; that the system will be 
in state j after a large number of transitions. 

Proof Define mj”, M, by 

(r) 


m; = min pi”, M p= ma X Pij 
i 


Then 
m,**) = min > papu” Za min È Pir mP = m;? 
(2.3) ‘ode ei 
5 Mo = max > pa Pu? < max > Pa MP = M;? 
t & i k 
so that 


(2.4) mY amA < Mj?) <M”. 


174 MARKOV PROCESSES—DISCRETE PARAMETER v 


If, for fixed «, B, >* and 2 denote summation over values of k for 
which p,,°” > pp,” and ne which Pax’? < Par” respectively, then 
(2.5) E — Poe”) 4 2 (Pax — Pac”) = 1-1 = 0, 
F 


and, if there are s values of k e J involved in Xt, 


(2.6) D+ (pa — P) < 1 — (N, — s)ô — sô = 1 — Nò. 
A 


Using these two facts, we find that, if n > 0, 
(2.7) Me” AD m” 
= Max ax (Pas? — PPa ™ 
Sia 


= Mex it (Par [co _ Pax”), + a (Pa? w Pax); 


= Max >t (Pa = Pax”)(M;\ pes m;™) 
aß k 


< (1— N(M; ™ — mi”). 
Moreover, 


G8) MOT mjO= bers (Pas — Pps”) 
< Max Z+ (pax — pp”) < 1— Nô. 
ap K 


Therefore, combining the two preceding inequalities, we find that 
(2.9) Mi — m” < (= N,0)*. 


Then [cf. (2.4)] M,™ and m,'” must have a common limit p; when n — ©, 
and 


(2.10) pis” — p| < MJ — mi < (1— Nd. 


IfjeJ,0 <ò < mP <p, Whenn — œ in (1.6), the resulting equation 
(taken together with the fact that the p,’s are non-negative and have sum 1) 
is precisely the condition that p}, © - +, py be a set of stationary absolute 
probabilities. This concludes the discussion of Case (b). 

There are many methods which can be used to carry the study of 
Markov chains further. We shall use one based on an analysis of the 
actual transitions. Before doing this we shall, however, give an example 
of the possibilities of a purely analytic method. The theorem we prove 
in this way will appear again as a consequence of more detailed results. 


§2 FINITE DIMENSIONAL MARKOY CHAINS 15 


THEOREM 2.1 Jf P: [py] is a stochastic matrix, Lim ey there 
is a stochastic matrix Q: [q] such that 


n 

(2.11) lim } Sp) ard h DAR N: 
no M m=1 

Moreover, QP = PQ = Q and @ = Q. 

Note that this is true with no restrictions. on the stochastic matrix 
[pu]. This theorem states that there is a long-time average probability 
that a system starting in state i will be in state j. In the following, if 
A,: [an(n] is a matrix for each n, A, —> A: [a] means a(n) > 4;; 
(n > œ) for all i, j. If A: [a;] and B: [b,j] are matrices A + B is the 
matrix [a;; + by], and if 2 is a constant AA is the matrix [Aa,,]. Then 
(2.11) becomes 

1 n 
(2.114 lim —- > P™=@. 
N m= 


n>n 
Now the matrices 
1 m 
US PIS leads erates 
N m=1 
are stochastic matrices. Since the elements of stochastic matrices are 
bounded by 1, every sequence of stochastic matrices contains a convergent 
subsequence. Hence to prove the existence of the limit Q in (2.1 1’) it is 
sufficient to show that the convergent subsequences of the averages on 
the left all have the same limit. Suppose then that 4 is the limit of 
some convergent subsequence, that is, for some sequence of integers 
Ny Ng SS 
1 % 
lim — > P™=A, 


gow Mj m=1 


Multiplying by P on the left and right, we find that 
1 n41 
lim = > P"=AP = PA. 
now Ny m=2 
Now the two averages in the preceding equations have the same limit 
when j — œ since these averages differ by the two terms 


prt P 


> 


n; nj 


which go to zero when j > %. Thus 
A= AP = PA. 


176 MARKOV PROCESSES—DISCRETE PARAMETER v 


This implies that, for every n, 
n 1 n 
A=A4P = PA= a(t 5 pr) i (: > pn). 
1 


m= N m=1 
Hence, if B is any limit matrix of a subsequence of the averages in (2.11), 
A = AB = ‘BA. 


Since A and B are arbitrary limit matrices here, they can be interchanged, 
giving 
B = BA = AB. 

Then A = B, that is, there is only one limit matrix, as was to be proved. 
Finally, according to the preceding equations, Q = QP = PQ and the 
limit Q is idempotent, Q?= Q. The study of idempotent stochastic 
matrices leads to more details on the asymptotic character of p,,‘") for 
large n. These details will be obtained by other methods below. 

We shall use the following lemma. 

Lemma 2.2 Ler S be a set of positive integers, with highest common 
factor d. Then, ifm + ne S whenever both m e S and n € $, all sufficiently 
large multiples of d are in S. 

To prove this we note first that there are finitely many integers 
my ‘+ *, m, in S whose highest common factor is d. Then there are 
integers ny * * *s My such that 


nm +++ + + n,m, = d. 


This equation states (putting the terms with negative n,’s on the right) 
that there are two integers in S of the form p,p + d. Then for every 
positive integer m the integers 


mp, mp + d,* + *,mp + md 
are in S, that is, if p, = p/d, the integers 
mp +nd=d(mp,+n), O<n=m 


are in S, These integers obviously include all multiples of d larger than 
pèd. 

In the following j will be called a consequent of i (of order n), relative 
to a given stochastic matrix, if p“ > 0. This means, if n = 1, that 
pis > 0, Or, if n> 1, that there are intermediate states %1, * * *5 %4 with 
Pis” © * Pan Z 9 If j is a consequent of i, it may be a consequent of 
several orders, but there is always a smallest order < N, where N is the 
dimension of the matrix. In fact, if two a;’s are equal (and here we take 
i= do, j = 2n), SAY %, = %, the States %,,° * *, m4 CAN be dropped from 


§2 FINITE DIMENSIONAL MARKOV CHAINS 177 


the succession; dropping all possible intermediate states in this way, a 
succession of distinct «,’s is finally obtained (except that % =, is 
possible) with n < N. 

Case (c) For every i and j, j is a consequent of i. 

Let d be the highest common factor of the orders in which 1 is a con- 
sequent of itself. Since 


Pe = Pa! Pu” 


these orders include the number m + n if m and n are included. Hence 
by the lemma these orders include all large multiples of d. Now 


lmn (L (m) (n) 
PA EP Po Pas 


and Zand n can be chosen to make the first and last factors on the right 
positive. Then this inequality shows that pi = 0 except possibly for 


values of m differing by certain multiples of d. Then if C, (x = 1, * * ^ d) 
is defined as the set of integers which are consequents of | of order nd + a 
for some n, the C,’s are disjunct and exhaust 1, °°, N. Evidently C,,. 
consists of the integers which are consequents of those in C, of order 
Gf CAIG gheama 
pee: a 
calo lee Eo Go 0 PaPa 
Ipo]: Ca | 0 0 Pog (pis): Cy | PaPa 9 0 
C, | Pa 0 0 C |0 PuPe 0 
G G fe 
C, | PiPPa 9 0 
[py ®]: Ca | 0 PaPaPia 0 
KR 0 Py PoP os 


CycLicaLLY MOVING CLASSES 


nd + | for somen (interpreting Ca+ı aS C,), and so on. Hence, if the states 
are renumbered so that those in C, come first, then those in Cz, * + *, the 
matrix [p;;] and its successive powers take the indicated forms (d = 3), 
running through these forms cyclically. Only the elements in a sub- 
matrix P, can be positive. The system runs cyclically through the sets of 
states C, Can *° 4) Ca Ga" * i Then [p,,‘“’] is a matrix whose elements 
vanish except for those in d square submatrices down the main diagonal. 


178 MARKOV PROCESSES—DISCRETE PARAMETER v 


We have seen that there is a v so large that py") > 0 if m >v. For 
each i, j 
Pi es pa pn pipe 


and we can choose / and n so that the first and third factors are positive, 
and so that /< N, n< N. The second factor is positive if m >v. If 
i, j are in the same C,, d will be a factor of l + n. Hence 


2N 
Pa O>0- if r= n; DJEG. 
d 
The matrix [p ®] with i, j € C, is thus a stochastic matrix satisfying the 
hypotheses of Case (b) above with J the class of all integers in C,. 
Therefore 
lim) por 05) Ty ekG,, 


n>a 
; Demy als 
Je Cy 


More generally, if i € C,, and if j e Cg 


(2.12) lim pu o = lim È Papu" = D> Pa Tp 


n> no keCg keg 
=m; 6b =%+m(modd) 


=) otherwise. 
This implies that 


Lane yA y 
a e 
which is independent of i. Thus, regardless of the initial state i, there is 
asymptotically an average probability 7,,/d of being in state j after a 
large number of transitions; more exactly, there is probability 0 or nearly 
Tg; Of being in state j after a large number n of transitions, depending on 
whether (if ie Ca, je C) « +n = (mod d) or not. For large n the 
non-zero boxes of p; contain only positive elements. 

Case (d) General case The integers 1, © - -, N can be divided into 
two classes as follows: a transient integer is one which has a consequent 
of which it is not itself a consequent; a non-transient integer is one which 
is a consequent of every one of its consequents. If jis a consequent of i, 
a non-transient integer, j is itself non-transient. In fact, if k is a con- 
sequent of j it is also one of i, i > j + k, and hence i must be a consequent 
of k (i is non-transient), i+>j—k-—i. Then j is a consequent of k, 
k — i >j, as was to be proved. In the following F will be the class of 


§2 FINITE DIMENSIONAL MARKOV CHAINS 179 


transient integers. The non-transient integers are further divided into 
ergodic classes, Ey, Ep, + > *, by putting two integers in the same class if 
they are consequents of each other. Then, if i € E,, pj; = 0 for j ¢ Ey. 


DR SESIR 


ERO anc 


F | Ra Re Rs Ru 


ErGODIC CLASSES 


If the states are numbered so that those in £, are first, then those in 
E,,* + «and finally those in F, the matrix [p,,] takes the indicated form 
(three ergodic classes are shown). Only the elements in a submatrix Rj; 
can be positive. The powers of [pu] have this same form. Now suppose 
that je F. We show that at least one of its consequents lies in an ergodic 
class, that is, at least one of its consequents is non-transient. Let I’; be 
the set of consequents of i. Then I’; certainly includes the consequents 
of all the integers in it, that is, if j e I; then ICT; let Py be the T}, 
jeTa with the smallest number òf elements. Because of this minimal 
property, I’,, must be the class of consequents of each of its own members. 
Then T, consists of non-transient integers, and in fact is an ergodic class; 
the integers of this ergodic class are all consequents of i. This means 
that in the first place there must actually be non-transient integers and in 
the second place each of the F rows of [p] must, for some n, contain 
a positive element in an E, column. For each a, E, determines its own 
stochastic matrix, the ath submatrix down the main diagonal of [pah 
and the powers of this matrix are the corresponding submatrices in the 
powers of [p;;]. Moreover, each of these stochastic matrices comes 
under Case (c). Thus we know the character of p,;‘") for large n if 
ieE,. Ifd, is the number of cyclically moving classes «Cr? ° *s aa, 
in Ex 

i€ gC, 
(2.12’) lim pene = aaj > 9, p=a+m (mod d,), ’ 
n> JE aCe 
=0 otherwise, 


where 


180 MARKOV PROCESSES—DISCRETE PARAMETER v 


Then 
1 2 Tas 
(2.13’) lm = > pete, Ge Ey, jf € «Cp 


n>o M m= a 
We have seen that the convergence is exponentially fast in (2.12’). 
Ifie F, > p,,;‘") cannot increase when n increases; it is the probability 
jer 
of going from / into Fin n transitions, that is, of remaining in F throughout 
the transitions, since once out of F the system stays out. Moreover, we 
have seen that the sum is < 1 for large n. Now choose u so large that 
all these row sums are <1 for n > p, 
> ps = 7 <1 Ra Vek: 
jer 
Then 
5 pyre = > Pix pas = n > Pan”, ie F, 
jek iker ker 


so that z 
>: Pye = 7" ieF. 
jek 


Since the sum > p„™ is monotone in n, 
jer 


Dope sy" —> 0) (no) he Ki 
jeF 


Now the sum on the left is the probability of being in F after n transitions, 
and the sum of this over n converges. Hence we have the following 
theorem: 

THEOREM 2.3 The system will remain among the transient states through 
only a finite number of transitions, with probability 1. 

The system thus goes into the non-transient states in the long run, even 
if it is initially iv a transient state. Let p,(£,) be the probability that the 
system will finally be in E,,, if it was initially in state i, 

pE) = lim È pj. 
n=>o jek, 
(The sum is monotone non-decreasing in n, since once in E, the system 
stays there, so that the nth sum is the probability of entry into £, at 
some time up to the nth transition.) If ie E,, pAE,) = 1; if re Ep 
b +a, p({E,)=0. Now if je, 


HAVEN BS O a) OS Pi” Pas”. 
kek, keF 
The last sum is < > p,,'", and we have seen that this goes to 0 when 
kek 


m-> œ. Then, ifm, n — co in this equation and if d, = 1, we find that 


lim pis = pE) am; je E,. 


lo 


§2 FINITE DIMENSIONAL MARKOV CHAINS 181 


More generally, if d, > 1 let p;(«Cp) be the probability that the system 
will finally be in „Cp, if it was initially in state í, in the course of transitions 
of orders da 2d,,* * * 

paC) = im pyle 


n> jeCg 


Then 
pil Eq) = > PilaCp). 
If j € «Cp 
partet mdatm) = § pa pyt $ 2, pO partt m), 
€ 


k € aC f-m 


where 6 — « is to be interpreted mod d,. When m, m,—> œ% in this 
equation we find that 


(2.12’) lim Pee ™ = pilaCp—m) aM pir J E aCe 


n=>o 
[Equation (2.12) is the special case with i ¢ F.] Then 


2 PilaCa) api pi(Ea) aps 


m ra E m 
(2.13) lim - > py” =- FE a 


no M m=1 


JE Cp 


We have thus proved Theorem 2.1 once again, in a proof giving insight 
into the transitions of the process. A detailed nomenclature has been 
used to distinguish between the various important types of stochastic 
matrices which we shall now discuss, but this nomenclature will not be 
given here, It is doubtful that the nomenclature would have been 
invented if the full results and their connotation in ergodic theory had 
been known earlier. 

The above diagram for [p4] 
ergodic classes is also valid for [qx], the limit in ( 
ever, the F x F box is now also a zero box. 
identical rows, 


exhibiting the effect of the division into 
2.11) and (2.13”); how- 
Each E, x E, box has 


Gis = aT pila > 0, ie E,, J €aCp 


The part of any F row below this box is proportional to the rows of this 
box, with constant of proportionality pi(E,) = 0. 

Case (e) The limit qi; of Theorem 2.1 is independent of i if and only if 
there is only a single ergodic class. This case is usually described by the 
statement that the system’s state becomes independent of the initial 
conditions as the number of transitions becomes infinite. 


182 MARKOV PROCESSES—DISCRETE PARAMETER v 


Case (f) The limit q,; of Theorem 2.1 can be taken as an ordinary limit 


lim pi” = qa 

n>a 
if and only if there are no cyclically transferring subclasses in any ergodic 
class. 

Case (g) The limit q;; of Theorem 2.1 is positive for all i, j if and only 
if there are no transient states and only one ergodic class. The elements 
qu; are then independent of i. 

Case (b) above, in which, for some v, p;;” > 0 whenever j e J, can now 
be characterized as follows: there are no transient states in J; there is 
only one ergodic class, and this class contains no cyclically moving 
subclasses. 

Case (h) If py, > 0 for all i, the same will be true no matter how the 
states are renumbered. Then under this condition there can be no 
cyclically moving classes of states, so that by (f) 

lim pi” = qu 

n>n 
for all i, j. Thus in this case the Cesàro limit in Theorem 2.1 can be 
replaced by an ordinary limit. The condition used is unnecessarily 
strong, however. 

Case (i) If Pi; = Pr the matrix [p,,], all its powers, and therefore also 
the limit matrix [g,;] are symmetric. Moreover, this is true no matter 
how the states are renumbered. There can, therefore, be no cyclically 
moving classes of states, and no transient states (both of which cases 
require unsymmetric matrices). Hence we have 

lim: py” = aty =q leE; 
n> eo 
Since [q,,] is symmetric, this becomes 
1 
lim p, ™ = — ie E,, 
ORS, Pi N, as 
where N, is the number of states in E,. In particular, if there is only one 
ergodic class, this becomes 
“ 1 ae 
Hs Pi” = ie ij=1,* 94, N. 

Case (j) In some applications the basic stochastic matrix satisfies the 

condition 


N 
2 Pu=l, fai. 


This condition, somewhat more general than that of symmetry, will be 
satisfied no matter how the states are renumbered, and will also be 


§2 FINITE DIMENSIONAL MARKOV CHAINS 183 


satisfied by the iterates [p,;(")], and by the limit [q,,]- Consequently there 
can be no transient states (which give rise to zero columns in [9,;]). 
Moreover, (2.11) becomes 

sid apy ae 

eget n RA Ee Ne 
where N, is the number of states in E,. The Cesaro limit cannot in 
general be replaced by an ordinary limit. 

The possible sets of stationary absolute probabilities are fully described 
in the following theorem. 

Tueorem 2.4 For each i, qu» ` * ‘> qin [defined by (2.11)] is a set of 
stationary absolute probabilities. Conversely, any set of stationary absolute 
probabilities is a linear combination (non-negative coefficients with sum 1) 
of these N sets, and even of the sets for i¢ F. More generally, every 
solution of 


(2.14) 


TMz 


&iPu = Ëp j=l, piney) 


is a linear combination of these sets, for i ¢ F. 

Note that according to this theorem each transient state has probability 
0 in a stationary Markov chain with a finite number of states. If ie Ew 
(qas © * > qiy) is independent of i. According to the theorem each 
ergodic class thus defines a single set of stationary absolute probabilities. 
Since the one defined by E, assigns probability 0 to all states outside Eu, 
the sets of stationary absolute probabilities defined in this way are linearly 
independent. We have already remarked that, if ie F and je Eu then 
qu is proportional to qi; for i € Ea and j € E,, so that a row (Jip ' * > qix) 
with je F defines a set of stationary absolute probabilities linearly de- 
pendent on those already described. 

To prove the theorem, we remark that according to Theorem 2.1, 


QP = Q, that is, 
N 
2, GPs = Iie pie NaN, 
i 


+ +, qin is a set of stationary absolute 


and according to this equation qj, ` i 
+, py is a set of stationary absolute 


probabilities. Conversely, if pi, * 
probabilities, then for all j and n 


and therefore also 
N n 
> Pes > Pa”) = Ps 


1 
i=1 "^ m=1 


184 MARKOV PROCESSES—DISCRETE PARAMETER v 


Hence (n — œ) 
2 Pidis S lee ca 


and these equations exhibit pı, * - ', py as a linear combination, with 
coefficients p,, - * *, Py, of the rows of Q, as was to be proved. More 
generally, this argument is applicable if pı, - ++, py are replaced by 
&, + + +, éy Satisfying (2.14). 

As an application consider the following rather crude idealization of a 
diffusion process. Two urns, U, and U,, contain white and black balls. 
A ball is taken at random from each urn and put in the other, and this 
procedure is continued indefinitely. The problem is to find the asymptotic 
distribution of balls in the two urns. Suppose that there are N balls in 
each urn and that there are N white balls and N black balls altogether. 
Let x, be the number of white balls in U, after the nth interchange. The 
2%, process is evidently a Markov chain, with stochastic transition matrix 
[pis] ij =0,+ + +, N given by 


i(N — i) A 
u=2 N O<i<N 
N—i\? 
Pin = (=) i<N 
i\e 
Poa = I 0<i 
Pi = otherwise. 


In this case p; > 0 for all i; j so that there are no transient states, only 
one ergodic class, and this has no cyclically moving subclasses. Hence 
lim p,;“ = p;> 0 
n>n 
exists for all i, j; Po ***', py is the uniquely defined set of absolute 
probabilities. To evaluate the p;s one could solve the equations they 
satisfy, 


N N 
2 PiPis = Pi DP =1, 
i=0 0 
but it is simpler to guess at a solution and check the validity of the guess. 
In fact, it is intuitively reasonable that an initial random distribution of 
the balls in the urns, obtained by choosing N balls at random out of the 
2N for U,, would be preserved. This suggests that 

e CD 
P GINA) PONYY 


§3 MULTIPLE MARKOV CHAINS 185 
and in fact this evaluation satisfies the equations for the p,’s. Therefore 


p (N14 
lim pf) =——— im? 
maw GOAN — FON)! 
and we have proved that no matter what the initial distribution of black 
and white balls in the two urns, the distribution after a large number of 
interchanges is asymptotically that obtained in filling the urns by choosing 
the balls for each at random. 


3. Multiple Markov chains 


We have already remarked, in II §6, that a multiple Markov process, 
that is, a process in which, roughly speaking, conditional probabilities of 
future states depend on v > 1 past states, can be reduced to Markov 
processes whose random variables are vectors. In the case of a finite 
dimensional Markov chain this means that the multiple processes can be 
reduced to finite dimensional Markov chains with a larger number of 
possible states. If the given transition probabilities are stationary, the 
only case we shall consider, we are given the transition probabilities 


Pfr, (0) =f | &p-@) = fay + 4 nal) = i} 


from which the other transition probabilities can be evaluated. If now 
&, is defined as the y-dimensional vector (n-ro °° ` Tn)» the ĉ„ process 
is a Markov process with stationary transition probabilities Pi,- < «ipj: «iy 
If each x, can assume N values, Pi,- - -i,j -iv defines a stochastic 
matrix with N” rows and columns, corresponding to the N” values which 
each &, can assume. However, most of the elements of the stochastic 


matrix vanish, since obviously 


Pista eee unless j = ip’ * sha = hs 
thus all but X*+! of the transition probabilities necessarily vanish. If the 
N” possible sets of values i, * * ` i, are enumerated in some way, the &, 
process can be considered an ordinary Markoy chain with stationary 
transition probabilities, whose random variables take on the values 
Fishin: eh ied EN 

Ly Sa Seppe range . z, becomes pj; with i, j = Dice yey, 5 As already 
’s vanish. The results of the previous section 
can now be applied, and the conditions of the previous section become 
conditions on the @, process which can be translated into conditions on 
the x, process. For example, by Theorem 2.1, 

il n 
lim - 2 Pos.» A dp 

m= 


no 


remarked, most of the p,; 


186 MARKOV PROCESSES—DISCRETE PARAMETER Vv 


exists for all 4, © ©, i, jp © * ‘+ J Summing over ji, * * * j-i» We find 
that 


lim 1 5 P{ar,,(@) =j | z0) = i, * = elo) = i} 


nao M may 
exists for all i, * > *, i,j. In particular, suppose that 
Pirlo) =j lao) = i e + 2,0) = i} > 0 
for all i,* * +, i,j. Then it follows that 
Donia a qiet C. 


for all i, * °°, i js * * “yj Which implies [Case (b) of the preceding 


section] that 
lim p™,,.. 


n> 


iht jy 


exists and is independent of i, + * *, i,, and accordingly that 


lim Pfx,(o) =j |0) = is °° 5% = i,} 


no 


exists and is independent of i, © ©% i. 


4. Application to card mixing 


The problem of card mixing is a good example of the application of 
Markov chains, It is supposed that M cards, labeled 1, © +, M, are 
shuffled repeatedly. Between shuffles their order in the deck is examined. 
The problem is to determine reasonable hypotheses to be imposed on the 
shuffling process that imply thorough mixing. 

A shuffle is simply a permutation of the order in the deck; that is, a 
shuffle corresponds to a permutation 


repels 
a «+. Ay 
of the integers 1,- - -, M; the jth card from the top of the deck goes into 


the a,th card from the top, j = 1,- - +, M. A succession of shuffles thus 
corresponds to a succession of the M! possible permutations. The 
thoroughness of the shuffling is dependent on how the permutations vary 
from shuffle to shuffle. For example, if the permutations are identical, 
each card will run cyclically through a succession of positions in the deck, 
and its position can be determined from a knowledge of the number of 
shuffles. The order in the deck, taking all the cards together, will also 
vary in this cyclic manner. This would not be consistent with any 
reasonable definition of thorough shuffling. On the other hand, the same 
cyclic character of changes of position would be true if the permutations 


§4 APPLICATION TO CARD MIXING 187 


ran successively over the M! possibilities, repeating the succession 
indefinitely. 

Neither of these examples exhibits the generally accepted idea of 
thorough mixing, which includes unpredictability on the basis of the 
initial positions and number of shuffles. It is necessary to introduce 
probabilistic ideas. To do so let X,, the permutation induced by the nth 
shuffle, be a random variable. We number the m! permutations, so that 
X,, can take on the values 1,- + -, M! The character of the shuffling 
then depends on the properties of the X, process, the shuffling process. 

In examining the transitions of individual cards or groups of cards it 
is appropriate to analyze certain processes derived from the shuffling 
process rather than the shuffling process itself. Let iM, + + +, i be any 
m (< M) distinct natural numbers, and let 


24m aS Gn!” DESA i,(™) 
be the m-tuple of positions of the cards, after n shuffles, that were initially 
in positions i, - - +, i™ respectively (“position” meaning, say, the 


number of a card counted from the top of the deck). The «,'" process 
is determined completely by the initial positions 1, « + -, i™ and by the 
X, process. The «,(") process is the natural tool in the study of the 
transitions of m-tuples of cards. 

The shuffling process will be said to induce thorough mixing if in the 
long run the M! different orders of the cards become equally likely, 
regardless of the initial shuffles, in the sense that for every positive integer s 


(4.1) lim Pœ, (0) = E [ay @9(@) = p + e No) = E = a 
for all sets of the s + 1 M-tuples & $1, * ` +, Ës of distinct positive integers 
between | and M. For some purposes a less thorough mixing is as useful 
as this. For example, if a magician (now known as a telepath) tells a 
victim to shuffle a deck and “take a card,” the victim need not be sure 
of thorough mixing to test the magician’s powers. The relevant condition 
here would be the far weaker one that, for every positive integer s and 
every integer k between 1 and M, : ; 

lim Pæ, D0) = k (n20) = jo i eo) =j) = 9G 

NO 
for all integers jı, * * ‘js If this is true, the M different possible positions 
of any card become equally likely in the long run, regardless of the first 
s positions. More generally, the shuffling process will be said to induce 
thorough u-tuple mixing if, for every positive integer s, 

(M— u)! 


(4.2) lim Piro) = E lao) = Gy 0%) = 8b = a 


n> 


188 MARKOV PROCESSES—DISCRETE PARAMETER v 


for all sets of the s + 1 y-tuples £, &,,- * *, &, of distinct integers between 
l and M. Thorough mixing is thus thorough M-tuple mixing. Tt is 
easily verified that thorough y-tuple mixing implies thorough pọ-tuple 
mixing if pọ < p. 

Hyroruesis (A) The variables X,, X», * > > of the shuffling process are 
mutually independent and have a common distribution. 

This hypothesis, that the X,’s correspond to repeated independent 
trials, is probably the most natural one. Let 


P, = PLX,(@) = J}, 
and let 
Pia)... OD... fo 


be the probability that cards in positions i), - « -, i go into positions 
j@, + + +, 7 by means of the nth shuffle. Then 


=) P, 
Pa)... Oj)... 2 P, 


where the sum is over all those permutations taking the first j-tuple into 
the second. The x,” process is clearly a Markov chain, and we have 
just evaluated the transition probability matrix. Moreover, when either 
the i’s or the j’s run through all possible y-tuples the permutations 
involved in the above sum run through all possible permutations, that is, 


yi whi + OD...) = 2 6 Pay... fj)... 


oad PURRE 


M! 
= J P =l 
k=1 


Thus the x,'") process is a Markov chain with stationary transition prob- 
abilities, whose stochastic matrix has column sums (as well as row sums) 1. 
The implications of this have been discussed in §2, Case (j). 

Since the 2,” process is a Markov chain, condition (4.2) for thorough 
mixing becomes 


42) lim Peo) = E leo) = Ey =A 


and since there are stationary transition probabilities it is no restriction 
to take s = 1. 

We continue by making more specific hypotheses on the shuffling 
process. These are certainly necessary to obtain interesting results, since, 
for example, Hypothesis (A) does not exclude the possibility that all the 
shuffles are identical. 


$4 APPLICATION TO CARD MIXING 189 


Hypotuesis (A,,) Hypothesis (A) is satisfied, and it is not possible to 
divide the u-tuples of card positions into two groups in such a way that no 
permutation with positive P, takes a p-tuple of one group into one of the 
other. 

Obviously (A,,) implies (A,,,) if 4o < #4- 

If (A,) is satisfied the x,“ chain has only a single ergodic class, so that 
[cf. §2, Case (e)] 

_ 1s M— u)! 
lim m aa PD. e KAD. = E 
that is to say, there is thorough j-tuple mixing in an average sense. It 
would contradict hypothesis (A,,) if all the permutations with positive P} 
left the top u cards of the deck on the top. On the other hand, suppose 
that every permutation with positive P, permutes the top and bottom 
halves of the deck (the example is modified slightly if M is odd), permuting 
among themselves the cards in each half, Then hypotheses (Aga tas 
(Ay) may be satisfied but none of the following hypotheses will be. 

Hyporuesis (A’,,) Hypothesis (A,) is satisfied and for all large s at 
least one u-tuple of card positions has positive probability of returning to 
itself after s shuffles. 

Hypothesis (A’,,) implies (A’,,) if Mo < x- 

If (A’,,) is satisfied, the æ, process has only one ergodic class, and 
there are no cyclically moving subclasses. Hence 

li fe py (M— u)! 
ie Pay... a)... f= yy”? 
and this is precisely condition (4.2’) for thorough j-tuple mixing. 

Hypotheses (A,,) and (A’,) are hypotheses on the zeros of the P,’s. If 
all the P,’s are positive, that is, if all permutations are possible in the 
shuffling process, Hypothesis (A’,) will be satisfied for all and there 
will be thorough mixing. In that case the stochastic matrices will have no 
zero elements. 

Hypotheses (A,,) and (A’,,) assume a particularly simple form if u = M: 
(A u) is equivalent to the statement that (A) is satisfied and that it is 
possible to go from any ordering of the deck to any other by way of 
permutations with P, > 0. (That is, the group of all permutations 1s 
generated by those with positive probability.) Hypothesis (A’,) adds 
the condition that for all large s it is possible to go from any ordering of 
the deck to any other (or back to the same) by way of s permutations of 
positive probability. This additional condition will be satisfied, for 
example, if the identical permutation (corresponding to not shuffling the 


deck at all) has positive probability. 


190 MARKOV PROCESSES—DISCRETE PARAMETER v 


It is natural to attempt to extend Hypothesis (A). One way would be 
to make hypotheses on the subsidiary «,(”” processes to insure thorough 
u-tuple mixing and to derive the corresponding hypotheses for the X, 
process. Suppose, for example, that the x," process is a Markov process 
with stationary transition probabilities. This implies, if » = M, that the 
probability P, of the rth permutation at the nth shuffle is unaffected by the 
results of previous shuffles. In other words, for u = M the hypothesis 
is simply Hypothesis (A). If  < M the hypothesis becomes weaker, but 
its formulation in terms of the X, process is not particularly enlightening. 

Another way to weaken Hypothesis (A) would be to suppose only that 
the X, process is a Markov chain with stationary transition probabilities. 
This hypothesis allows the shuffling process a certain memory, which is 
reasonable. In general if the X,, process is a multiple Markov chain of 
multiplicity », the a,“ process will be a multiple Markov chain of 
multiplicity v + 1, and the general Markov chain theorems can be applied 
to analyze the mixing (cf. §3). The subsidiary ,(” processes with 
u < M will not necessarily be multiple Markov chains, however. 


5. Generalization of §2 to general state spaces 

The following is the natural generalization of a stochastic matrix [p,,]. 
Let X be a space of points £, and let F x be a Borel field of X sets. A 
function p(-, :) of £ e X and A e F x will be called a stochastic transition 
Junction if it has the following properties: 

(i) p(é, A) for fixed £ determines a probability measure in 4; 

(ii) p(é, A) for fixed A determines a & function measurable with respect 
to the field F y. 

If an initial probability distribution p(-) of sets A e. F x is chosen, a 
probability measure can be defined in the space Q of sequences 
w: (En E» © *), &; € X, as follows. Let x, be the nth coordinate function, 
so that x(w) = £n if œ is the point (£i, &,- © +). Define 


Pf{x,(w) € Ay} = p(A), 4, € Fx, 
and in general define 
(5.1) Pf{xy(w) e Ay, * > +5 tlw) € A,} 


= | rae) | rE 4) ++ f pera di), Ape Fx. 
a Ay “A's 
This definition determines a probability measure of F sets, where 
F =F y X Fx x + -isthe Borel field of w sets generated by the class 
of those of the form 
{x,(@) € A}, Janay AEF y, 
(see Supplement, §2). 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 191 
In particular, if the initial probability distribution is concentrated at 
the point , that is, if 
pA)=1, &eA 
=0 EeX—A 


AcFx, 


we shall denote the probability of an w set A by 

P{A | ziw) = £} 
and the expectation of a random variable x by 

Efe | z0) = 5}. 
Here “random variable” means as usual a measurable w function, and the 
use of conditional probability and expectation notation will be justified 
in the next paragraph. Replacing sequences (44, Es, + + *) by sequences 
(Eqs Enpo * +’), We find, in the obvious way, definitions of 

P{A |x,(o) = E} Efe |2,() = E. 

If X is the set of real numbers, and if F x is the field of linear Borel 
sets, a stochastic process {x,, n = 1} is defined by the above procedure. 
For any initial distribution the process is a Markoy process, and the 
quantities defined above become versions of the conditional probabilities 
and expectations indicated by the notation. These versions will always 
be the ones used in the following sections. The restriction of X and F y 
to the line and the class of Borel sets will not be made, however. In other 
words, in this section we consider Markov processes whose random 
variables are not necessarily numerically valued. The reader made 
uneasy by this generality may prefer to restrict X and F y, but will not 
thereby simplify the discussion. 

The n-step transition probabilities are easily calculated [cf. (1.6)], 

p®(E, A) = pE, A), 


(5.2) p", A) = fea. Ap, dn), 
x 


and are also stochastic transition functions. The probability of being in 
A at time n is given recursively by [cf. (1.7)] 


P{a,,(w) € A} = p(A) n=1 
= [pIE aps) n> 1 
x 


If this probability is independent of n the x, process is strictly stationary 
and p(-) is called a stationary absolute probability distribution. 


192 MARKOV PROCESSES—DISCRETE PARAMETER v 


Example 1 Let X space consist of exactly N points labeled 1, + - +, N 
and let F y consist of X and all its subsets. Let [p;;] be any N-dimensional 
stochastic matrix, and define p(&, A) by 


PE, A)= È Pe 
ned 


Then p(,:) is a stochastic transition function, and every stochastic 
transition function with this ¥ and Fx can be generated in this way. 
Moreover, the iterated stochastic transition functions determined by (5.2) 
are generated in this way by the powers of [p;,]. Thus the study of 
p'é, A) for large n becomes that of p,,'") for large n; this study was 
carried out in §2, The results obtained in that case go over almost 
completely in the general case if rather weak hypotheses are imposed on 
p(-,:). Example 4 below shows, however, that these results do not 
always hold. 

For each &, if A is small, p(é, A) is small, roughly speaking, since 
pë, A)—>0 when A—>0. The hypotheses made on p(§, A) usually 
impose some kind of uniformity in € on the smallness of p(&, A) for small 
A. In the following we shall use the hypothesis of Doeblin, somewhat 
generalized over his original formulation. 

Hyporuesis (D) There is a (finite-valued) measure p of sets A eFx 
with p(X) > 0, an integer v > 1, and a positive £, such that 


PEAASI- if ALe. 


We observe that, if (D) is satisfied, for some g, v, and e, it is satisfied for 
every n> v with the same g and e, since, if ọ(4) < e€, 


prem, A = f pC, Ape, dn) < (1— )p™E, X) = 1— e. 
x 


Example 1 (continued) Hypothesis (D) is always satisfied in Example 1. 
In fact, if g(A) is defined as the number of points (integers) in A, then, if 
(A) <1, A is the null set so that p(é, A) = 0. Then (D) is satisfied if 
0 <e <1. Thus Hypothesis (D) imposes no restriction on finite dimen- 
sional stochastic matrices. Now suppose that the space ¥ has denumer- 
ably infinitely many points denoted by 1, 2,- - +, so that the corresponding 
stochastic matrix is infinite dimensional. Any g must be determined as 
follows. Let c}, co, * * * be any non-negative numbers with > c; < oo 
and define (A) by : 

9A) = È cz 
ieA 


Then (D) is certainly satisfied if, for example, the series > p,; converges 


=} 
uniformly in i; but uniformity of convergence is an unnecessarily strong 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 193 


condition. If [p,,] is the identity matrix, condition (D) is not satisfied 
for any choice of p, that is, for any choice of the ¢,’s. 

Example 2 Let X be a Borel set in m-dimensional Euclidean space, and 
let F x consist of the Borel subsets of X. Let pol’, *) be a Baire function 
of £, 7 for which 

Polé n) = 0, | ple. n) dyn = 1, 
x 


where £, y are points in m dimensions and the integration is with respect 
to Lebesgue measure in m dimensions. Then the function defined by 


pE, A) = J pE, n) dn 
A 


is a stochastic transition function. If we define PoE; n) recursively by 
po™(E, 0) = Polé, 0) n=1 
= | DE Opt n a, n>, 
x 


then 
pE, A) = | pol. 0) dn, n>. 
A 


In this important case p(-, *) is the integral of a stochastic transition density 
function po(-, *), and the iterate p™(, «) is the integral of the iterated density 
function po"(-, :). This case can be generalized in the obvious way by 
dropping the hypothesis that X is a set in a Euclidean space and that the 
basic measure of X sets is Lebesgue measure. Without going to this 
generalization we remark that, if ¢(A) is defined as the Lebesgue measure 
of A, if p(X) < ©, and if pC, +) is bounded, say, pol, n) < K, then 
plé, A) < Ke) 

so that Hypothesis (D) is satisfied if e<1/(K+ 1). More generally, if 
Y(X) < œ and if po(E, 7) is supposed uniformly (in &) integrable in 4 
Hypothesis (D) is still satisfied. However, even this condition, for the 
given p measure, is far stronger than (D) since it implies that, if yA) is 
near 0, p(&, A) is near 0 uniformly in &, whereas (D) only requires that 


under those hypotheses p(&, A) be <1 uniformly in E. S 
Example 3 If p(-,*) is a stochastic transition function and if y is some 


finite-valued measure of F x sets with the property that for some v» 
pE, A) < YA), 
then (D) is satisfied if 0 < £ < $ and if p =y. The following stronger 
condition is sometimes useful: 
pre, A) < Kp” 0, A) 


194 MARKOV PROCESSES—DISCRETE PARAMETER v 


for all £, n, A, for some integer » and constant K. This condition is far 
stronger than the first, since it implies that if, for some € and some 


sequence Ar, An * *, 
lim p(&, A,) =0 


n> 


then this limit equation is true for all and is uniform in &. 

In the following we denote by p'""(&, A) the conditional probability that 
(from initial point £) the system will be in a state of A at some time during 
the first n transitions, that is, 


n+l 
BME, A) = BU les) € A] | x(w) = §} 
The quantity p'"(é, A) is non-decreasing in n. 
Lemma 5.1 Under Hypothesis (D), if a set A € F x has the property that 
lim p™(é, A) = L.U.B. pi, A) > 0 


n> oO n 
for all £, then there is positive integer u and a positive p < 1 for which 
pg, A) ZI pt, Ee X, 


Since p'"(é, A) is monotone non-decreasing in n, and positive for 
sufficiently large n (depending on &), there is a 6 > 0, a positive integer «, 
and a set B e.F y for which 

AB) < £, 


p(E, A) = 6, &eX—B. 
Then, according to Hypothesis (D), 
(5.3) pe, B) S l-e, eX. 
To prove the lemma we take u = « + v and prove 
(5.4) 1= PNE A) < (1—0, 


which implies the desired statement, with p = 1 — ðe. When m= 1, 
(5.4) becomes 
PNE A) S bera EEX 


which follows [using (5.3)] from 
PNE A) >= f Pn, AYE, dr) 
xX-B 


> dp, X— B) > de. 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 195 


To prove (5.4) in general we suppose it is true for m and prove it for 
m-+1. Let 7™(&, E) be the conditional probability, starting at é, of 
going into Æ after nu transitions, always remaining out of A. Then 


1 — pE, A)= ime, X— A) 


and assuming (5.4) for some m (and also for m = 1, in which case it has 
already been proved) 


nlm EX — A) = Í n(n, X— Ayr (E, dn) 


XA 
< (1— ôe)"7 (E, X— A) 
< (1 — ôe)" 


as was to be proved. 

The study of the asymptotic properties of p™(Ẹ, A) as n > % will go 
step by step, taking in succession cases which are generalized versions of 
those in §2. If p(&, A) is the integral of a density function which is 
reasonably positive, the proofs go over with no change. The difficulty is 
to show how Hypothesis (D) implies the existence and good behavior of 
a density function. To handle these density functions we introduce a 
“two-dimensional” measure of (€, n) sets,  « X, 9 € X in the usual way. 
If E and E’ are sets in the class F x, the &, n “rectangle” Ê defined by the 
conditions € e E, 7 € E’ is assigned the measure 


HE) = (EAE), 
where g is the measure in Hypothesis (D). This measure is extended in 


the usual way to the sets of the Borel field Fy = Fy X Fx of the 
(E, n) sets generated by the class of rectangles. 

For each & and each n, p'")(&, *) is a measure of F x sets, and therefore 
has an absolutely continuous and a singular component with respect to 
the measure g (see Supplement, §2). That is, we can write 


(5.5) ping, E) = | po Es mod) + AE, E), 
E 


where po("(é,*) is an y function measurable with respect to F x, and 
A((é, +) is a measure of F x sets with maximum value on a set (depending 
on n and &), of measure 0. We shall denote by Condition (2) the condition 
that this representation of p™ is possible for all n with pọ™ measurable in 
the pair £, n, that is, measurable with respect to F x, and when Condition 
(3) is satisfied we shall always suppose that the pọ™ used has this property. 
Attention will be called to every use of this condition, which will be used 


196 MARKOV PROCESSES—DISCRETE PARAMETER v 


only to obtain certain preliminary results. The final results will not 
require the validity of this condition. Note that, if Condition (X) is 
satisfied, A((-, A) is a € function measurable with respect to. F x. More- 
over, under this condition, 


prime, A) = fem A)pn(, db) 


x 


> f (an) | oE, DoE Da(dd), 
A x 


so that, if A is a subset of the complement of the singular set of the 
measure A(™+"(E, +), 


frome, Dalan) = | olan) | PoE, ME, Dodd). 
A A x 


Then this inequality holds for all A, since the singular set in question has 
y measure 0, and it follows that 


pol (En) = | PE DPE, Orde) 
x 


for almost all 7 (p measure). We shall assume from now on, if Condition 
(È) is satisfied, that the po'*”’s have been chosen in such a way that this 
inequality is true for ally. To justify this assumption, we must show that 
the po\*”’s can be chosen to satisfy this condition. Such a choice can be 
made as follows. Assume first any choice of the pp*”’s. We then choose 
po”s, new versions of the pos, which satisfy the stated condition. 
Choose fg") = po). If fo” has been chosen already, for j < k (k> 1) 
we have seen that for each € 


PoE) = Max f pC, MAE, DdD, 
Sm<k xy 


for almost all 7. Define fo(é, 7) as the maximum of the left and right 
sides of the inequality, for all &, 7. It is immediately verified that this 
choice satisfies the stated conditions. 

According to Example 2.7 of the Supplement, Condition (2) is satisfied 
if the following condition is satisfied. 

CoNDITION (X) There is a sequence of F x sets, generating the Borel 
field G, such that to every F x set corresponds a G set differing from it 
by a set of p measure 0. 

Condition (%’) is satisfied in most cases of interest. For example, if 
X is a Borel set in k-dimensional Cartesian space, if F x is the class of 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 197 


k-dimensional Borel subsets of X, and if g is a measure of Borel sets, the 
sequence of F y sets of Condition (X^) can be taken as the intersections 
with Y of the k-dimensional open intervals with rational vertices. 

We now discuss various cases, always under (D) and sometimes under 
the additional condition (=). 

Case (a) p(&, E) = p(E) is independent of E. 1n this case, 


png, E) = pE) 


for n > 1, the random variables of the Markov process are mutually 
independent, and the discussion of §2, Case (a), needs no modification. 
Case (b) Suppose that (D) is strengthened to (D’): there is a measure 
pof F x sets, with 0 < p(X) < %, an integer v > 1,a ò> 0, and an F x 
set C for which 
g(C)> 0, 


po(é, n) = ô, Ee X, nec. 


Here po, -) is, as above, the density of the absolutely continuous com- 
ponent of p°(&, +) with respect to p, but we do not suppose that Condition 
(©) is satisfied. Then there is a stationary absolute probability distribution 
pC), with pC) = ôp(Cı) when C, C C, for which 


(5.6) [perme A) — p(A)| <[l-— d(C, n=1,2,-°° 
We observe that (D’) implies (D), since if (D’) is true and if 
pA) < ¢(C)/2, 
ôp(C) 


PEA | PE 
(X—A)C 


so that (D) is true with the same », p, and with e the smaller of the numbers 


O)/2, ôp(C)/2. 

Case (b) is the obvious analogue of §2, Case (b), and the proof of 
(5.6) will only be sketched since it follows that of §2 (2.7). Define 
m ma Mg” by 


mg = GLB. pE, E), Mr” = LUB. pE, E). 


Then just as in §2, cf. (2.4), 
mg) ama a +» My?) < Mp. 


For fixed £ and 7 the set function defined by 
(E) = PE, E)— pn, E) 


198 MARKOV PROCESSES—DISCRETE PARAMETER v 


is completely additive. There is therefore a certain set S+ (that on which 
(E) is a maximum) on every F x subset of which y(Z) > 0 and on every 
F x subset of whose complement S~, y(E) < 0. Then 


y(St) + pS) = wX) = 0, 
y(S*) = pE, S*)— p”, S*) 


<1- f PE DdD — | pon, Depa’) 
Š- 5+ 


< 1— 69(C). 

Using these two facts, we find that 
Mg" — mg™ = L.U.B. fem, E)[p€E, de) — p(n, d0) 

an x 


< LUB. || Me™y(dt) + | myad] 
$n S+ š- 

= L.U.B. y(SH[M p" — mz] 
$n 


< [1 — 69(C) (Me™ — mp). 
Then as in §2 we find that 
My — mi <[1— se, k21, 
that Mg™, mg™ must have a common limit p(E) when n + oo, and that 
|E, E) — pE)| < Mz — mp <[1— d(C”, 


if C.ac, 
RC) = mo > d(C). 


The set function p(-) is non-negative, with p(X) = 1. As the uniform 
limit of completely additive set functions it is itself completely additive. 
That is, p(-) is a probability measure. From (5.2), 


pone, E) = | p(n, E)p™E, dr), 
x 


and this becomes, when m —> 00, 
PE) = | pa, E)ptdn). 
x 


This equation identifies p(-) as a stationary absolute probability distribu- 
tion, and ends the discussion of Case (c). 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 199 


We shall see that under Hypothesis (D) there is a 9(6, E) for which 


n 
lim 3 2 p™(&, E) = q4(6, E) 
n—> co m= 
exists uniformly in and E. We omit the proof corresponding to that 
of Theorem 2.1 because it involves ideas of compactness in Banach spaces, 
and abstract ergodic theorems which would take us too far afield. The 
result will be obtained by a detailed discussion of the actual transitions, 
following Doeblin. 

Case (c) Suppose that the œ of (D) has the property that, whenever 


QE) > 0, 
L.U.B. p\"(é, E) > 0 
n 


for all ë. This condition is obviously equivalent to the condition that 


L.U.B. p'm(é, E) > 0 


for all £, which implies, by Lemma 5.1, that lim p,(¢, £) = 1 uniformly 


n>n 
in ë. In other words, in this natural generalization of Case (c) of §2 we 
suppose that every ¢ has positive probability of entering (at some time) 
every set E of positive pọ measure. The next two lemmas will make it 
possible to use the arguments of §2 to treat this case. 


Lemma 5.2 Let A be a (&, n) set in Fx. There is then a sequence of 
decompositions of X, 


x= OHM, water Aer, HOH =0(j # k), 
j=l 


with the property that, if for each n the subscripts i, j are chosen, as functions 
of & n, so that EH", ne H;”, and if Ay,” is the (&, n) set 
{é e Hi”, n e Hy} then f 
(5.7 tim R DEE 
. im. aly 
) SPEAN 


for almost all (&, n) in Ĥ (@ measure). 
This lemma states, in more geometric language, that the H;'"”’s can be 
chosen in such a way that almost all points of Å have density 1 in A 


relative to the net of the A's. If we define the function é of F x 


es (A) = (AA), 


then mi te 
aA) = | vga, 
a 


200 MARKOV PROCESSES—DISCRETE PARAMETER v 


where y is the characteristic function of H, that is, 


yOH=1, ġe A 


=0 otherwise. 


Thus y is the density of the absolutely continuous set function ¢ relative 
to @. The ratio in (5.7) is the nth generalized difference quotient at 
(&, n) of ê, on the net of A,,;'"”s, relative to @. If every H,,("+» is a 
subset of some A;;‘"), this nth generalized difference quotient converges 
for almost all (£, 7) when n —> œ to the derivative of é with respect to ĝ 
relative to the net, according to Theorem 2.4 of the Supplement. Ac- 
cording to this same theorem, this derivative is equal to the density y 
almost everywhere if y is measurable with respect to the Borel field 
generated by the Ĥ,;™’s, that is, if Ê is in this Borel field. Now (Supple- 
ment, Theorem 2.5) there is a sequence {B,,} of F x sets such that, if F y 
is the Borel field generated by {B,}, then He Fy’ x Fy’. For each n 


n 
define H,™, - + +, Hn") as the intersections of the form N A;, where A; 
1 


is either B; or Y—B;. These H;'")’s are disjunct with union X, for 
each n, each H,”+D is a subset of some H,™, and the class of H's, 
in >l, generates the Borel field Fy’. Then the H,'"’s satisfy the 
requirements of the lemma. 

Lemma 5.3 Under Conditions (D) and (©) there are sets A eF y, 


BeF x, for which WA>0,  9(B)>0 
G.L.B. po(E, n) > 0 

fed 

neB 


To prove this let Ê be the (é, n) set where 


1 
2 ane NE nsr 


Here « is as described in (D). Let A; be the 7 set for which (é, n) e A 
and let B, be the £ set for which (¢, 7) « A. We prove first that 


(5.8) Hayes. 
In fact, 
ne 


> J PE, Malda) 


Se fo Moan) 
{PNE n)<e/2K(X)]}) 


— PE, mold) — AE, X). 
(PNE, 0) >1/e} 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 201 


Now the first integral on the right side of this inequality is at most ¢/2. 
The domain of integration of the second has @ measure at most e (since 
the value of the integral is at most 1). Hence the second integral and 
Ae, X), which is the contribution to p"(&, X) from a set of p measure 
0 give a total contribution of at most 1 — £, by condition (D). Thus 


GAs) E€ e 
ubini a nA d — Te 
z 2 Ee (I-) 25 


and we have proved (5.8). We now apply Lemma 5.2 to H, and show 
that there are two points (Eo, No) and (&,, M) in Ê where there is a limit 
1 in (5.7) and for which no = Ex According to Lemma 5.2, applying 
Fubini’s theorem, there is a & set X’ with o(X— X’) =90 and for each 
£e X’ a m measurable set Az C A; with WA:— AZ) = 9 such that the 
limit in (5.7) is 1 if eX’, n€ Ag. Choose čo, Mo é 7, to satisfy the 


following conditions: 
fe xe 
No = E As X 
M € Az. 


This is possible since GX’) = AX) > 0 and, by (5.8), Az, X) = Az) 
> 2/2, and o(Ae/) = (Az) = 2/2. The two points (Eo 70)» (r M) 
have the desired properties. Lemma 5.3 will now be proved. For each 
n choose i, j, k so that 

éo € Hi, my = $1 € HY, m € H” 
Denote by H,,("” the ¥ set of points 


(using the notation of Lemma 5.2). 
Lemma 5.2 n can be chosen so large 


é, n where & e H, ™, ne Hi. By 
that 
PG AgH, pde) = GA Hs) > 39H) = Yo) ™) 
HA”) 
| eB, yp) = KA Ha’) > JAn) = 190) 
HA” 
In the following n will be chosen in this way and held fast. There are 
then sets A and B in F x, for which 


ACH, @A)>0% WAH) > 3(H), (eA), 


BCH,™, g(B)>0 — BqH'")> 3o(H;™), (eB). 
k WA 


+ 
202 MARKOV PROCESSES—DISCRETE PARAMETER y ; 


It follows that p(4;B,H;™) > 4(H,) so that, if ëe A and ġe B, 


ee kill 
PEDS f aE OO Dd = Leal se. 
By Hy 
which proves the lemma. 
LEMMA 5.4 In Case (c), under Conditions (D) and (©), there is a set C, 
with g(C) > 0, for which 
G.L.B. po (E, n) > 0 


anec 
for some «. 


In fact, let A and B be as in Lemma 5.3. In Case (c) there must be a 
set Ce Fx, a positive 6, and a positive integer 8 for which 


CCB, AC)>0, pE A) 20, (Ee C). 
Then, taking into account transitions from C to A to C, we find that 


Po? t XE, n) => 6 G.L.B. po”, n). 
ted 


neB 


This proves the lemma, with « = f + 2v. 
We proceed to the study of the transitions in Case (c). For each C 
with the properties described in Lemma 5.4 let /(C) be the class of integers 


n for which 
G.L.B. po'(E, n) > 0. 
nec 


The class X(C) contains m + ny if it contains n, and n, and therefore, 
according to Lemma 2.2, contains all large multiples of d(C), the highest 
common factor of its members. Moreover, we now prove that d(C) is 
independent of C. In fact, if m e 1(C,), if ng € (C,), if mg is chosen so 
that p""(E, C,) > 0 on a subset of C, of positive p measure [hypothesis 
of Case (c)], and if mg, is chosen so that p™(£, C,) > 0 on a subset of 
C, of positive p measure, the succession of transitions C, to C, to C, to 
C, to C, to C, corresponds to the fact that 


My + My + Mg + Mg, + ny € (C). 


Since this is true for all n, € J(C,), d(C.) is a factor of every difference 
between two values of ng; in particular, d(C,) is a factor of d(C,). Hence 
d(C) = (Cy) because C, and C, can be interchanged in this argument. 
In the following we shall write d for d(C). 
Suppose now that d= 1; the general case will be reduced to this one. 
+ Let C have the properties described in Lemma 5.4. By Lemma 5.1 there 
is a 6 > O and an integer u for which 


PYOE CVS, FX. 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 203 


Then, since 


PEOS 2 PE, O, 


there is a 6 = f(E) < u for which p(£,C)>6/u. The hypothesis 
d= | means that X(C) contains all sufficiently large integers, say all > N, 
It follows that, if 7 e C and & is arbitrary, 


pS * (8, n > f PSAE, pE, do) 
č 


= ó GLB. po "FKL, n) 
M mget 


é 
2- Min G.L.B. po™(%, n) > 0, 
M NangN+n-l 4,060 


and this is precisely the hypothesis of Case (b) with the v of that hypothesis 
identified with N + u here. Then, according to the discussion under 
Case (b), there is a stationary absolute probability distribution 7 such that 


lim p'"(&, E) = (BE) 


exists uniformly in ë and Æ. The limit is approached with exponential 
speed. Moreover, if C,C C and if (C,)> 0, then 7(G,)> 0, We 
shall prove that in the present case ¢(£) > 0 implies that 7(£)> 0 for 
all E (in. Fy). In fact (stationarity) 
18 
aE) = | pE, E)n(ds) = ~ > f po, Edm(a) 
AY. 


Xx 


>* f PE, Ema). 
n 


Now in Case (c), if g(E) > 0, P(E, E) is positive for large n (and in fact 
converges uniformly to | when n > %, by Lemma 5.1). Then 7(£) > 0 


if p(E) > 0. 
Now suppose that d> 1, and define C;, j = 1, + > +, d, as the & set for 


a 
which p("-(E, C) > 0, for some integer n= 1. Then X= U, Č, by 


the hypothesis of Case (c). The C,’s are not necessarily disjunct. 
LEMMA 5.5 For all n, under Conditions (D) and (2), 


PE CO)=0, j#k 


except perhaps on a & set of p measure 0, and 


WCC) = 0. 


204 MARKOV PROCESSES—DISCRETE PARAMETER v 


In fact, if p(&, C,C,) > 0 on a set of positive p measure, there are 
integers m and n for which the inequalities 


perme, C)> 0, pE, C)> 0 


are simultaneously true on a set of positive p. measure and (decreasing the 
latter set slightly to another set E of positive @ measure) for which both 
inequalities are true with 0 replaced on the right-hand sides by a positive 
number. By the hypothesis of Case (c) there is a C’ C C, with g(C’) > 0 


and a f, such that 
G.L.B. p(E, E) > 0. 
eC’ 


Hence if i e I(C) the transitions C to C’ to E to C to C show that 
i+ p+a+md—j+ieK(C) 
it+Ppt+a+nd—k+ieK(C). 
Then dis a factor of both left-hand sums, and therefore of their difference 
\k—j|. This implies that k = j, and finishes the proof of the first part 
of the lemma. To prove the second part we note that if (æ) > 9 
there is an n for which p™(E, C,C,) > 0 on a & set of positive p measure 
by the hypothesis of Case (c), which means, by the first part of the lemma, 
that j = k. 

We shall use the following reduction procedure below. Suppose that 
FeF y and that pë, F)=0 if fe X—F. Then if ¥ — F is not empty 
the reduced function pl, -) of € and E, considered only for § « X — F and 
EC X— F, is a stochastic transition function with X replaced by X— F. 
If a point is ever in X — F it stays there, so that 


pom, X— F) = p'™(E, X— F), Ee X. 
For example, if Fy «F x and if F, is the £ set defined by the condition 
pE, Fo) > 9, then F = Ù F,, has the property demanded of F above. 
n=0 


Finally we remark that in every case, if Condition (D) is satisfied by 
p(é, E), it is also satisfied by the reduced plé, E), and in particular if 


(F) = 0, $ 
lim p™(é, F) = 0 
n>a 
uniformly in &e X. The first part of this statement is obvious; the 
second follows from Lemma 5.1 since [Condition (D)] 
pore, X— F) = POE, X— F) 2 £, EeX, 


if g(F) = 0. The convergence to 0 will be exponentially fast, according 
to Lemma 5.1. 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 205 


Now, if the Čs of Lemma 5.5 were actually disjunct, we should have 
PÉ, Cu) =l, Fe G 


(interpreting Čas G). The system would run cyclically through states 
in Gre e Cr Casts ae Moreover, the function p'%)(é, E), for 
&«C, ECC,, would be a stochastic transition function with the 
properties the original p(é, E) would have in cased=1. It would then 
follow from the preceding work that 
lim p(,E) ($e Č) 

exists uniformly in €and E. Although in general the C,’s are not disjunct, 
we can apply the reduction procedure described above to obtain a reduced 
stochastic transition function for which they are disjunct. We delete 
from X the set Fy = U, C,Č, augmented by the F,’s as described in the 


reduction procedure. According to Lemma 5.5, (Fo) = 9. If pF) > 9 
for some n, there would be positive probability that a point in X— Fo 
would go into F, in some number (say m) of transitions, according to the 
hypothesis of Case (c). Then that point in X— Fy would go into Fy in 
m +n transitions, with positive probability, contradicting Lemma 5.5. 
Thus 9(F,,) = 0 for all n, and it follows that p(F) = 0. For the reduced 
stochastic transition function the C;’s become the disjunct sets 
Gida O= C(X— F). 


Then the remarks made above relevant to disjunct C;’s apply. Taking 
into account the fact (proved in discussing the reduction procedure) that 


lim p\™(é, F) = 0 eX 


non 
uniformly (exponentially fast) in ë, we find that 
lim p"(E,E)=7(E) Eec, 


uniformly in § and E «Fx. Here m, isa probability measure of sets E, 
with y 
3 n{C)=—1, mAE)>0 if gEC)> 0. 


More generally, as in §2, 
pame, E) = psi EC, Fe Ce 
B= «~ + m (mod d) 


d 
pi lim p46, E)= 7E) F € Ca 


Ta B = a + m (mod d). 


206 MARKOV PROCESSES—DISCRETE PARAMETER v 


The convergence is uniform and exponentially fast. This equation 
implies that 


d 
ras > TLE) z 
lim’ = > pE, z=", , EeU Co 


n>o M m=1 


The asymptotic character of p™(8, E) for & e F will be investigated in the 
next (more general) case. 

Case (d) General case under Hypothesis (D). A set E will be called a 
consequent set if, for some čp p'"(&, E) = 1 for all n, and in this case E 
will be called a consequent of &. By (D), if E is a consequent set p(E) = €. 
A set which is a consequent of every one of its points will be called an 
invariant set. Am invariant set is then either empty or has y measure = €. 

If E is a consequent of &, and if F, is the set of points £ of E for which 
pig, E) <1, then E— F, is a consequent of čp» since otherwise 
(Eo, Fn) > O for some m, and then 


1= P Eo E) = i png, E) p ™ (Eo dé) 


Fy, 
+ f peng, Epo» 48) 
E-F, 
<p" (Eps Fy) + P(E, E— Fn) = pO (Ew E) = 1. 
Thus if E is a consequent set it contains the non-null invariant set 


Q (E—F,). 

Now suppose that £ is an invariant set. If it contains no (non-null) 
invariant set of smaller p measure it will be called minimal. Every 
consequent set E contains a non-null minimal invariant set, which can be 
obtained as follows. We can assume that Æ is invariant (decreasing E if 
necessary to make this true). If E contains no point &, with consequent 
E, for which (£) < ¢(E), then E is minimal. If there is such a pair, 
ëp E, E, can be supposed invariant. Repeating the argument, we find 
a finite or infinite sequence E, Es, * * * of invariant sets, with 


Ey Ege sge 
AE) > WE)>* E 


Let pa = G.L.B. 9(F) for F non-null, invariant, and a subset GE 

Then we can suppose that if E, is not minimal E,,,, is chosen so that 

(En) < Pn + 1/n. If there are only finitely many £,’s, the last is the 

desired minimal invariant set; if there are infinitely many, A E, is the 
1 


desired set, since it is non-null, is invariant, and has ọ measure lim p, 


n> 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 207 


If E, and E, are invariant, EE, must also be invariant. In particular, 
if E is minimal, EE, is an invariant subset. of a minimal invariant set. 
It follows that ¢(E,) = 9(E,£2) unless EE» is the null set. Hence (if E, 
is also minimal) two minimal invariant sets are either disjunct or differ by 
at most a set of ọ measure 0. Let E, Ey, ++ + be an enumeration of 
the essentially distinct non-null minimal invariant sets. That is, £,E, 


= 0 (j + k), HE) = £, and if E is any non-null minimal invariant set it 
differs from some E; by at most a set of p measure 0. There are at most 
ol X)le Eps. 
For every o L.U.B. P™Eo u E;)> 0 or X— S E; would be a con- 
n 


sequent of & and would, therefore, contain a non-null minimal invariant 
subset which should have appeared among the Ep's. Thus every point of 
X will finally enter the E;'s; once in a point stays in, so that 


p(X, U E) = P(E, Y E). 
J 


THEOREM 5.6 (cf. Theorem 2.3) For every é, 
lim p™(, U E) = 1, 
J 


neo 
and in fact for some p < 1 
1 — p™ (E, U E;) < const. pe) heals a 
J 


uniformly in &. Hence, each & will remain (with probability 1) outside the 
E;s only a finite number of times in its transitions. 
This theorem follows at once from Lemma 5.1 and the Borel-Cantelli 


lemma. 
For each a, p(¢, E) with § e Ey and E C E, defines a stochastic transition 


function for which the hypotheses of Case (c) are satisfied. Hence, if we 
suppose for the moment that Condition (2) is satisfied, and if 
aCi * * ‘saa, are the cyclically moving classes in Ea, as discussed under 
Case (c), with E, reduced as in Case (c), to make E, = y Ci 
(5:9) pirdet™(E, E) = pirat (E, Earp), &e.C, 
B= œ + m (mod CAN 
and 
(5.10) lim p(r4et™(E, E) = am), Ee Cy 
Piet B= œ + m (mod dy): 

The convergence is uniform and exponentially fast. Here ,7, is a prob- 
ability measure of sets E, with 

atplaCp) = 1 at™E) > 0 if (E.C) > 9. 


208 MARKOV PROCESSES—DISCRETE PARAMETER ' v 


The conditional probability that the system, initially at &, will finally be 
in E,, is 
6.11) plé, Ea) = lim pi, Ea), 

no 


where p'")(&, E,) is non-decreasing in since once the system is in Æ, it 
remains there. If £eE,, p(é,£,)=1 and in general, according to 
Theorem 5.6, > p(é, Ea) = 1. Considering only positions after da, 2d,,°** 
transitions, the conditional probability that the system, initially at é, will 
finally be in «C, is 
(5.12) p(E, aC) = lim p°, «C,). 

n> 


If Ee C,, pE iC) =l. Then 


dq 
plé, Ea) = > pPlEr aCa) È € aa 
and, as in §2, = 
(5.13) lim piem (E, E) = $, p(Es aCa) aTatmlE), 


for all &, where « + m is to be reduced mod d, in the subscript. If & is 
in an ergodic set, this reduces to the limit equation already obtained. 
The convergence is again uniform and exponentially fast. Finally this 
limit equation implies that, for all £ «Fx, 


n da 
(5.14) lim > pmo, E)= > S pl, aC.) aT a+-m(E) 
nao M m=1 a, 


a m=1 da 
= Ð pl, aCa) at(E) 
= > plé, E,) a7(E) 

where K 
da m (E 
615) wE) = 3 HE) 
a=1 š 


As so defined, ,7 is a probability measure of sets E, with 

an(Ea) = | 

w(E)>0 if g(EE,)> 0. 
These equations hold for all € and E, and the limits are uniform in ë and 
E; the limits are approached exponentially fast where Cesaro sums are 
not involved. 

The results (5.9)-(5.15) have been based on the validity of Condition (4) 

as well as Condition (D). We now show how to obtain these results 
without Condition (£). We shall call a Borel field F x’ C F x admissible 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 209 


if the £ function p(-, A) is measurable with respect to F x’ when A eF x’, 
and if Condition (X), which is stronger than (©), is satisfied for Fy. 
For example, the Borel field containing only the null set and ¥ is admis- 
sible. Then the following statements are true. 

(i) If{B,} is a finite or infinite sequence of F x sets, there is an admissible 
Borel field F x’ with {B,} CF x’. 

(ii) If {F x} is a finite or infinite sequence of admissible Borel fields, 
there is an admissible Borel field F x’ with O F x™ CF x’. 


n 

The first statement implies the second, because, if (i) is true, we can 
take as the B,’s in (ii) all the F x sets in the sequences of sets involved in 
the (X^) condition for F x”, F x, + + +. To prove (i), define F(A) for 
any A e Fx as the denumerable class of & sets of the form 


{p A) <r} rational. 
Then define F x’ as the Borel field of & sets generated by Y,, = Oe rs 
where GY = {B,}, and, if Zo * * *, Y, have already been defined, 


$n =OG,0G,), 9 =Y' FA), 


and the prime on the union symbol means that the union is over all sets 
A which are finite unions of finite intersections of sets which are G p, sets 
or complements of Y,, sets. Then @ „ is denumerable. By definition of 
Fx, {B,}C Fx’. Let Y be the class of F x sets A with the property 
that p(-, A) is a function measurable with respect to. F x’. We show that 
F y’ is admissible by showing that 9D F x’. Class G „ is a field of sets, 
since by definition Y,,,, includes the field generated by Y,. Moreover 
GD F, because, if A € fp pC, A) isa E function measurable with respect 
to the Borel field generated by p+. Finally G obviously includes limits 
of monotone sequences of ¥ sets. Hence (Supplement, Theorem 1.2); 
GD F x’, as was to be proved. 

Since all the results of this section are applicable if 7 x is replaced by 
an admissible Borel field, we can use the properties (i), (ii) of these 
admissible Borel fields to obtain results for F x itself. For example, if 
Ac Fx and if F x satisfies Condition (2), we have proved that 


lim + SpE, 4) 
no N T 
exists, uniformly in ë and A. If F x does not satisfy Condition (£) and 
if A e Fx, replace F x by an admissible Borel field containing A to get 
that the above limit exists uniformly in å. To get the full results of this 
section, we proceed as follows. We defined ergodic classes F,, Ey, °° * 
without the use of Condition (£). Let F x’ be any admissible Borel field 


210 MARKOV PROCESSES—DISCRETE PARAMETER v 


of & sets, such that {£,} C.F x’. There is such an admissible Borel field 
by (i). Then, replacing Fx by F x’, we find cyclically moving classes 
aCi? "s aay da = 1 in E, relative to Fy’. If Fy” is another admis- 
sible Borel field, with Fx’ CF x”, the cyclically moving classes relative 
to Fx” are for each a the same ,C,’s, or sets obtained from these by a 
finer decomposition of E, with d, a multiple of its previous value. Since 
d, < ¢(X)/e, there must be for each a an admissible F x! maximizing d,. 
Then, using (ii), there must be an admissible Borel field F y maximizing 
d, for all a. Now suppose that Fx’ is any admissible Borel field of & 
sets, with Yy CF x’. Then the cyclically moving classes relative to F x’ 
can be taken the same as those relative to Yx, and all the constants which 
govern the speed of convergence will be the same as those relative to 
Gy. Since Fx’ can be taken to include any F x set, by combining (i) 
and (ii), we have proved that all our results, including the uniformity 
results, summarized in (5.9)-(5.15), are true even without Condition (©). 

If a stochastic transition function p(-, +) satisfies Condition (D) with the 
triple g, v, £, the triple will be called a (D) triple (for the given transition 
function). If a stochastic transition function has one (D) triple, it has 
infinitely many. This fact, together with the fact that even for a given 
(D) triple the decomposition of X obtained above is not uniquely deter- 
mined, obscures to some extent the significance of the condition and the 
results obtained. , The following discussion is given to clarify the choice 
of (D) triples, and the decompositions of X obtained. 

In the following, if F is a set for which 

lim pME,F)=0, eX, 
no 

the set F will be called a transient set. Suppose that there is a decompo- 
sition of X into disjunct invariant sets E,, E, + - + and a transient set 
F= X— U E,, and that to each E, corresponds a probability measure 


ar of sets E € F x such that 
i 1 n 
an(Ea) = l, lim — È pM, E) = (E), Fe Ek, 
n>o M m=1 
Then: the £,’s will be called ergodic sets. Suppose furthermore that E, 
can be decomposed into d, > 1 disjunct sets «Cy, * * *, «Ca, such that 
PEs Can) =l, EeaCu a=l, da 


(where ¿Ca is interpreted as ,C,), and that to each ,C, corresponds a 
probability measure ,7, of F x sets such that 


aMaaCa) = 1, lim pte*™(E, E) = ar(E), £ € gas 
n>a 
b = œ + m (mod d,). 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 211 


Then the sets ,Cy, © * + aCa, Will be called cyclically moving subsets of the 
ergodic set E,. If thz above decompositions are possible, p(é, E,) and 
p(é, gC,) can be defined as above, and (5,13), (5.14), and (5.15) will be 
true. We have proved that these decompositions are possible if Condition 
(D) is satisfied. We now show that, if a decomposition into ergodic and 
transient sets is possible, the set functions 7, ọm, °° * are uniquely 
determined (aside from order). To see this suppose that 


Ei E; REDS F, 1T, 2T, * ae 
and 
By Ee mar 1T (E) ar (E) 

are two decompositions into ergodic and transient sets with the corres- 
ponding set functions. Clearly no E,(E,) can be entirely contained in 
F’(F), because the ergodic sets are invariant, whereas the points of a 
transient set finally go into its complement. Set Æ, must therefore have 
a non-null intersection with some E,’, say with Æ. Then, if Ee EE, 
(5.14) implies that ,7(Z) = 7 (E). Since no two set functions 47’, yr 
with a + b can be identical, E, cannot have points in common with any 
other E,’. An obvious elaboration of the argument we have given then 
shows that the £,’s and E’,’s can be numbered in such a way that 


RECIBE 
E,E; #0 
BA GE ae 
am(E)=qr(E), a@=1,2,°°". 
Thus we have proved that the set functions ,7, 97, °° * are uniquely 


determined (aside from order), and that the ergodic sets are uniquely 
determined neglecting points in transient sets. A similar argument proves 
that if there are cyclically moving sets Cy, * * *» aCap With corresponding 
set functions «Ti, * * "s aTa, these set functions are uniquely determined 
(aside from order), and the cyclically moving sets are uniquely determined 
(aside from order), points in transient sets being neglected. The number 
of ergodic sets in a given decomposition into ergodic and transient sets is 
then uniquely determined. If there are n we shall simply say there are n 
ergodic sets. Similarly we shall say that an ergodic set has d, cyclically 
moving subsets if this is the number in any decomposition of the ergodic 
set into such subsets. In the following we shall suppose that a decompo- 
sition into cyclically moving sets and a transient set is possible and that 
some ordering is used, so that the set functions {,7,}, {a7} are uniquely 
determined. The function values pl, Eq), P(E, aCa) are also uniquely 
determined (that is, they depend only on & and a, and on é, a, and « 
respectively), as is obvious from their definition. 


212 MARKOV PROCESSES—DISCRETE PARAMETER v 


We can define a decomposition of X into ergodic and transient sets 
using only the set functions 17, o7, * * * by setting E, as the & set where 


1 n end 
lim — X pé,E)=qn(E), Ee Fx. 
1 


noo M m= 


(This set E, is the set where p(&, Ea) = 1.) Similarly we can define a 
decomposition into cyclically moving sets by setting aC, as the & set where 


lim p@(E, E) = q7{E), EeFx. 


n>a 


(The set ,C, is the set where p(£, «C,) = 1.) If we define 
E, = U Cy 
the sets Ey, En © - -, X- U E, provide still another decomposition into 


ergodic and transient sets. If E, E» * + +, X— U E, is any decomposition 
a 
into ergodic and transient sets, 


EGE eai 


if the ergodic sets are ordered properly. If Cı, «Co, ** * is a further 
decomposition into cyclically moving sets, 


aCa © Cas 


if the cyclically moving sets are ordered properly. 

Now suppose that Condition (D) is satisfied, so that ergodic and 
transient sets {E,}, F, and cyclically moving sets {,C,} exist, as proved 
above. Define ¢(£) by 


GE) = È aE). 


a 


Then, assuming proper ordering of the sets involved, 
E GEC EE 
Res UE Re mE 
GF) = 0. 
Now we can write the function g of Condition (D) in the form 


PE) = | AERA) + WE) 
E 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 213 


where f(é) > 0 and y, a finite-valued measure, is the singular component 
of p with respect to 9, so that there is a set of @ measure 0 on which y 
takes its maximum value, p(X). Since «7(E) > 0 whenever 9(EE,) > 0, 
G(E) = 0 implies that PEU E,)=0. Then 


yU E,) = 9. 
a 


In other words, the y measure is confined to F. Since 
pane, aoa) = 1, Ee aa 
Condition (D) implies that p(.C,) = £- Then 
ECO 


ala 
and hence f(é) must be > 0 on a set of positive @ measure on each 
Se or Conversely, if Condition (D) is satisfied, and if fı is any 
non-negative function measurable with respect to F y, and > 0 on a 
subset of each ,C, of positive # measure, and if y, is any finite-valued 
measure of sets E eF x which is singular with respect to @ measure, 
define p,(E) by 
gE) = [AE Has) + yE). 
E 


Then it is easy to verify that Condition (D) is satisfied with p1, %1 &» 
for some », &1 We have thus characterized the class of p’s occurring in 
(D) triples [assuming that there is at least one (D) triple]. The simplest 
choice of p is @ itself, obtained by setting fi = 1, Yı = 0. With this 
choice a far stronger condition than (D) is satisfied. It is easily shown, 
using the intimate relation between @ and the transition probabilities, that 
given any e, > 0 there is an £, > 0 and an integer v such that if GE) < £z 
then p”(é, E) < & for all é. (If e is very small, » will be very large.) 
In practice, of course, the existence of @ is not known until it has been 
verified that there is some (D) triple, and the ¢ of this (D) triple will not 
usually be @. However, this remark shows that Condition (D) is a 
posteriori equivalent to much stronger conditions. 

The preceding discussion implies that if the E,{’s and ,C,'’s are 
choices of the ergodic and cyclically moving sets determined by a sto- 
chastic transition function satisfying Condition (D), for i= 1, 2, then 


(with proper ordering of the sets) 


GEO E,D E=. 
i= 1,2. 


Paea S ENY TERSA) =0 


214 MARKOV PROCESSES—DISCRETE PARAMETER y 


Moreover, if Pı, 1» & is a (D) triple for this stochastic transition function 
and if the E,”’s are minimal invariant sets for this (D) triple, as defined 
above, (Es) — E,” E,) =0 
GalaCa? — aC Ca) = 0. 


Then, if the £,{®”’s are minimal invariant sets for the (D) triple Pa vo &2 
E,™ differs from £,' (and „C, from aC,®) by at most the union of a 
set of p, measure 0 and a set of p, measure 0. 

The stationary absolute probability distributions are fully described in 
the following theorem. 

THEOREM 5.7 Under Condition (D), q(&, E) [defined as the limit in 
(5.14)] defines for each & a stationary absolute probability distribution; if 
ë «€ Ep qC, E) is independent of & q(&, E)= mE). Conversely, every 
stationary absolute probability distribution is a linear combination (non- 
negative coefficients with sum 1) of the ams. More generally, every 


solution oj 
‘| | PE, Ed = WE) 
x 


(y finite-valued and completely additive) is a linear combination of the „7's. 

The proof is exactly the same as in the stochastic matrix case (Theorem 
2.4) and will be omitted. This theorem implies that the number of ergodic 
sets depends only on the given stochastic transition function, not on the 
(D) triple involved in Condition (D). We have already derived this fact 
above. It can be shown that the stationary absolute probability distribu- 
tions form a convex set in a suitably defined linear space, and that 
i7, ym, * + “are the vertices of this set. This gives a characterization of 
the stationary absolute probability distributions independent of particular 
(D) triples. 

The important classes of stochastic transition functions satisfying 
Condition (D) can be treated in exactly the same way as stochastic 
matrices (see §2), and the discussion will be given without proofs. It is 
supposed throughout that there is a (D) triple p, », £, and that ergodic 
sets, a transient set, and cyclically moving sets (if any) have been chosen 
in some definite way. 

Case (€) The limit q(é, E) defined by (5.14) is independent of & if and 
only if there is only a single ergodic set. 

Case (f) The limit q(&, E) exists as an ordinary (rather than Cesdro) 
limit if and only if there are no cyclically moving subsets in any ergodic 
set (d, = 1). 

Case (g) The limit q(&, E) is positive for all & whenever pE) > 0 if and 
only if the transient set has @ measure 0 and there is only one ergodic set. 


OT oar 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 215 


Case(h) Under Condition (©) if for every 6 > 0 there are sets By, Sq" 
in F y with y S, = X, (S) < 4, and palë, n) > 0 for every E, n Sy, for 
each j, then there can be no cyclically moving subsets of ergodic sets. 
This will be true, for example, if X is a Euclidean space, F y the field of 
Borel sets, and if pl, *) is continuous with po(f, £) > 0 for all &. 

Case (i) Under Condition (Œ) if p(&,*) is absolutely continuous with 
respect to @-), with density Pë), and if pol. n) = Po, £), then 
pi(E, +) will also have a symmetric density function, There will then be 
no cyclically moving subsets of ergodic sets, and the transient set will have 
p measure 0, The limit q(é, E) will be determined by the symmetric 
density qo(,*) (cf. the discussion below of Examples 2 and 3), which 
satisfies the equations 


1 
qf. n) = WED Eck, nek, 
=0 FeE, Nf Ev 
= 0 ne F. 


In particular, if there is only one ergodic set, 


1 
qol, 0) = waxy dF 
= 0 nek. 


Case (j) If 
[oE Bld) = En EF x 
x 


(that is, if pC)/p(X) is a stationary absolute probability distribution), then 
pm, ) and therefore gC, *) satisfy this same equation. Hence (substi- 
tuting F for E) the transient set F must have g measure 0. Moreover, 
since q(¢, E) = om{E) if È € Ew 

AE) tek, 


që, E) = (E) = py, 


Thus, as in Case (i), q(¢, E) is given by a constant density 1/q(E,) for 
EeE,. Since there may be cyclically moving classes in case (j) [although 
they cannot be present in Case (i)}, the Cesdro limit in (5.14) cannot be 
replaced by an ordinary limit. i 

Examples 2,3 (continued) Suppose that Condition (2) is satisfied and 
that the stochastic transition function has the property that 


png, E) < HE) 


216 MARKOV PROCESSES—DISCRETE PARAMETER v 


for all ë and E e.F x, where v is some integer and ọ is some finite-valued 
measure. Then we have seen that Condition (D) is satisfied with the 
given y and vif e< $. The transition function p™(, :) has a density 
function, 


pE, E) = | pE, Molar), 
E 


and, more generally, po((é, -), the density of p™(¢, :), is defined for 
n> v by 


PE, D = f pE DPE, d9. 
x 


We observe that, although p)(”(-, :) is not uniquely determined by p\(-,), 
po(é, *) is determined up to an 7 set of œ measure 0, for each &. We can 
assume that p)”(’, *) satisfies the conditions 


0< po(E,n) <1 
po”, n) = 0 (FeE,, ¢£,), 
PoE, n) = 0 Gea, N f aCaty): 


If po, n) is chosen to satisfy these conditions, po'"(E, 7), uniquely 
defined as above, satisfies the same conditions with » replaced by n > v. 
[As usual the forward subscript of «C, is to be interpreted (mod d,).] 
Since p™(¢&, E)=0 if n =v whenever g(E)=0, the limits 4(¢, E), 
aE), «am (E) must also vanish whenever (E£) = 0, so that „7 and a7, are 
absolutely continuous with respect to p, and they determine density 
functions, We shall show that the limit theorems proyed above for the 
set functions imply the obvious corresponding limit theorems for the 
densities. 
For example, since 
P(E, F) S vp" 


for some constants y, p, with 0 < p <1, uniformly in ë and n, it follows 
that 
PEDE ye" eF 
In fact, 
PoE, 1) = | PE DPE, db) < pE, F) 
A 
Lyp, nek. 


The other limit theorems go over in the same way; the densities approach 
the limit densities uniformly with the same speed as the distributions 
approach their limits. To show the principle involved, we take the case 


§5 GENERALIZATION OF §2 TO GENERAL STATE SPACES 217 


in which there is only one ergodic set, which has no cyclically moving 
subsets, and prove that, from the inequality we have already proved in 
this case 

|P", E)— p(E)| < ve", 
where p = y7 is the (unique) stationary absolute probability distribution, 
y > 0, and 0 < p <1, it follows that 


|p, 0) — poln)| < 2yp" 
for all ë, y. Here pol) is the density of the p(-) distribution, 


| panad) = pE) 


E 


for all Ee. F y. The condition that po(:) be the p(-) density only fixes it 
up to a set of measure 0, The function is defined uniquely as follows: 
Since p(-) is a stationary absolute probability distribution, 


| png, Epa) = p(E), 
x 
and this implies that 


S poi, mpas) = pol, n=» 
x 


for almost all 7 (p measure). We can, therefore, define po() uniquely as 
the left side of this equation for n =». It follows that the equation is 
true for all n =v. With this definition we have 


[PoE n) — PD] = | | PE mip" dr) — plate 
x 


<y, 

as was to be proved. If the above conventions had not been made on 

Po (E, n) and poln), these results would hold for each & up to an 7 set 
of measure 0. 

Finally we discuss the imp! 

of Example 3, 


lications of the stronger of the two conditions 


pe, E) < Kp, E) 
for some integer » and constant K; the condition is to hold for all 
Eh EeF y. Itimplies 


- On, E) < pO, E) < Kp, E). 


218 MARKOV PROCESSES—DISCRETE PARAMETER v 


Hence, if we define ọ(£) = p(n, E) for some 7, we see that the previous 
discussion is applicable and moreover that the density po'(&, 7) can be 
defined to satisfy 


1 
R= PoE, n) < K. 


Under this condition then the hypotheses of Case (b) are satisfied. (It is 
actually obvious from the original condition, without the introduction of 
densities, that there is only one ergodic class and that there are no 
cyclically moving subclasses.) This case was first discussed by Kolmo- 
gorov, who proved the conclusions drawn in Case (b), using essentially 
the proof given (without the introduction of densities, which is quite 
superfluous in this case). 

Finally we give a simple example in which some of the conclusions 
drawn in this section hold, but in which Condition (D) is not satisfied. 

Example 4 Letz taois oDe real-valued random variables constituting 
a Gaussian process with 

E{x,} = 0 


Bieta} = p". 


Here p is a real parameter, 0 < p < 1. The a, process is a stationary 
Markov process (see §8 and X, §4). The conditional distribution of 
%y41 for given 2%, is Gaussian with expectation p"«, and variance | — p°"; 


that is, aaah 
_ (=o 
POE E) = [aml — pyr? |e 8 dy. 
E 

Then 


-r 


1 ete 
lim p™(é, E) = an [ e” dy. 


If Æ is a finite interval this convergence is not uniform in &, because, for 
each n, p™ (E, E) can be made arbitrarily small by making large. Thus, 
simple as this case is, (D) is not satisfied. The iterated stochastic transi- 
tion functions converge (but not uniformly) to a stationary absolute 
probability distribution function. 


6. The law of large numbers 

In this section we shall prove a version of the law of large numbers 
applicable to the processes studied in §5. We suppose first only that 
pls") is a stochastic transition function as defined in §5, and that 
Xy ty ' + are random variables (not necessarily numerically valued) 


§6 THE LAW OF LARGE NUMBERS 219 


constituting a Markov process with the given p(-,*) as transition prob- 
ability distribution (from #, to naa 

THEOREM 6.1 If the x, distribution pC) is a stationary absolute prob- 
ability distribution, and if f is a function of eX, measurable with respect 
to F x, with 


EI = | IOLE) < 
then 3 


ren oy 
(6.1) lim = > S Em) 
noo M m=1 
exists with probability 1. In particular, under Hypothesis (D), if there is 
only one ergodic set, 


nao Amal 


6.1’) tim | $ fle) = [AOP = EED 
x 


with probability 1. 

The first statement of the theorem is simply an application of the 
strong law of large numbers for strictly stationary processes (ergodic 
theorem) to be proved in X (X, Theorem 2.1). It is remarked in X, §1, 
that if the hypotheses of the second part of Theorem 6.1 are satisfied the 
process is metrically transitive, and in that case the strong law of large 
numbers prescribes the limit in (6.1). 

As a simple application suppose that the process is a Markov chain 
with a finite number of states, as discussed in §2, and suppose that there 
is only one ergodic class, with no cyclically moving subclasses, According 
to §2 there is then one and only one set of stationary absolute probabil- 
ities; suppose that these are used to determine a stationary process. 
Then, if ja is the number of the first n states of the system which are j, 


lim Zt = p, = Plo) =} 


n=% 


with probability 1. In fact, if f (£) is defined as 1 if ë = j and 0 otherwise, 


i oe j 
1S f@m) =" 
n ae = n 
and Theorem 6.1 therefore implies the desired statement. The extension 
of this result to general state spaces is obvious. 

If Condition (D) is satisfied, and if there is more than one ergodic class, 
the most general stationary absolute probability distribution has the form 


220 MARKOV PROCESSES—DISCRETE PARAMETER v 


S qa am, where 0< qu, >a = l» and we use here and in the following 
a 


a 
the notation of §5. The limit in (6.1) under these conditions is clearly 


JIO d) = [JO D O) « Eo 
Ey T 


with probability 1. We observe that 
Piz (w) « F = X— U E,} = 0. 


The limit may thus depend on z,(w) and will, in general, if there is more 


than one ergodic set. 
Probabilities were defined in Theorem 6.1 and in the extension just 


made by the given stochastic transition function, together with a stationary 
initial distribution of 2. More generally, we shall now allow any initial 
distribution 7 of a, sets. The relevant expectations and probabilities 
will be denoted by E,{—}, P,{—} to stress the dependence on the initial 
distribution m. The most important initial distributions are the stationary 
ones and those concentrated at a single point. In the latter case, if the 
point is &, expectations and probabilities become the conditional expecta- 
tions and probabilities E{— | x(@) = &}, P{— Izo) = £} 
THEOREM 6.2 Let f be a function of & e X, measurable with respect to 

F x, with 

[Olamda <0 a= 142-7" 

Eq 


Then, under Hypothesis (D), for any initial distribution of probabilities, 


(6.2) lim E sf (Em) 
N m=1 


n=>o 


exists with probability 1, and is f f(E) ard) ifto) € Ea with probability 1. 


È, 
According to Theorem 6.1, as trivially generalized after its proof, the 
present theorem is true if the initial distribution ~ is a stationary absolute 
probability distribution. The choice m = 1/1 > „m, where / is the number 


of ergodic classes, then shows that if x(w) = £ € U E,, but if & does 
not belong to some set Ay with > yn(Ao) = 0, the limit (6.2) will exist 


a 
(and have the stated value if ¿ e E,) with probability 1 when the 7 
distribution is concentrated at the single point $. It follows that the 
limit (6.2) will exist and have the stated value with probability 1 when 
the m distribution is unrestricted except that 7(Ao) = 0. To finish the 
proof it will thus suffice to show that Ag is the null set. According to §5, 


§7 THE CENTRAL LIMIT THEOREM 221 


for any &,, if x(w) = &, x,(w) will finally be in an ergodic set, and in 
fact will remain in whichever one it enters first. More than that, x,(«) 
will at some time be in 


G=UE,—4A UE, 
because ,7(Ay) = 0 for all a implies by (5.11) that 
lim p'(§,, G) = 1. 


Consider the sample sequences for which xy(w) is the first x,(w) in G. 
Then for these sample sequences the averages in (6.2) have the same 
limits (if any) as the averages 

1 n 

= Sem 2 = MNT Ls 


But these averages behave like the averages 


] 7-N+1 
= >. AEO AEN NP 
m=i 


with the initial x, distribution confined to G. Hence these averages 


approach 
[SO (ae) 
LA 


when «y(«) € Ea with probability 1. The limit in (6.2) thus exists with 
probability 1 for the 7(A) distribution concentrated at 2(@) = ¢. In 
particular, if ¢; € E, the limit is as stated in the theorem. Thus Ao is 
empty, as was to be proved. 


7. The central limit theorem 

In this section we shall discuss the applicability of the central limit 
theorem to Markov processes. We shall assume (Do): 

(a) Condition (D) is satisfied; 

(b) there is only a single ergodic set and this set contains no cyclically 


moving subsets. 
We have proved in §5 that under Hypothesis (Dy) there are positive 


constants y, and p, p < I, and a (unique) stationary absolute probability 
distribution p(-), such that 
(7.1) POE E)— RENS y n= 1,2,+-° 


The distribution p(-) taken as initial distribution together with the 
stochastic transition function determines a stationary Markov process, 


222 MARKOV PROCESSES—DISCRETE PARAMETER y 


and in this section the expectations and probabilities involved will always 
be based on this process unless the contrary is stated explicitly. Thus 


ELS En) = | FO 


Elf EEEn} = | AOPA | gp E, dn) 
and so on. E ‘ 
Define S(&, A) by 


St, A) = È POE AD- ADL 
and V,,(&, A) by 
V,(é, A) = Max [p™(¢, B) — p(B)] — Min [p™ (E, B) — P(B)). 
BCA BCA 
Then V,(&, A) is the variation of the nth summand of S(&, *) on A, for 


fixed £; the first term on the right is the positive variation, the second 
is the negative variation. There is a set A; on all subsets of which 


p\(é, 4)— pA) = 0 
and on all subsets of whose complement 

p\(&, A) — p(A) < 9, 
so that the defining equation of V,(&, A) becomes 

V,(é, A) = [p\(&, 4 A) — pA AD] 
— [p'(E, AX — Ad) — PAX — Ap). 

The variation V,„(£, A) is completely additive in A for fixed ¢, and 
V,(g, A) < 2yp” 
V,(&, A) < p'"(&, A) + pA). 


We shall prove several lemmas needed for the central limit theorem. 
In each lemma it is supposed that Hypothesis (Do) is satisfied, and the 
lemmas all express in various ways the fact that then a, and myn are 
nearly independent if n is large. 

LEMMA 7.1 Under Hypothesis (Do), if f is a random variable on 
tis © * * Cp, Sample space, and g one ON mps mirtis ` ` ` sample space, 
and if for some r> 1, s> 1 with 1/r + 1/s = 1, 


E(f} <% Elsi} < v, 


(7.2) 


then 
(7.3) EUD — EC fPElg}| <2" ES Eeh k2. 


§7 THE CENTRAL LIMIT THEOREM 223 


In particular, if f and g are functions of & « X, measurable with respect to 
F x, and if, with r, s as above, 


ESE Elge} < o, 


then 
C4) Š Eeden — EEEE 
= [ FOAD { 20DE, ai). 
x x 


In fact, applying Holder’s inequality repeatedly, 
JE(fg}— E{/}E{g}| 
= [E(fIEfg | tm} — Efg}}] 
= [ELF | Efg lma) = P mn dr) — pdr} 
x 
< ESIA] f Ellgl I Emito) = lem d) 
£ 
< ESIE f Ells] Emilo) = VEn AD 


x 


< yp") SIYE, | Esl |e meal) = VEn A) 


Xx 


< (2p E H] AIYE, [Elle mlo) = mile ms dr) + pla 


x 
= (2y E] f ELH" 
= 2y E" AYE") 
which yields (7.3). In particular, if f and g are as described in the second 


part of the theorem, we can apply the preceding continued inequality to 
f&a), Z@), with m = 1, obtaining from the fourth and the last lines 


S ILOLE) | |ga| Ya an) < 27'E AIE g 9. 
bi x 
Then the integral on the right in (7.4) is absolutely convergent, because 
ISG, A] < > VEA, 
so that the integral is dominated by the convergent series 


2, [IOIA | leere 0 = Sapte seat 
AA x = 


224 MARKOV PROCESSES—DISCRETE PARAMETER N, 


The series on the left in (7.4) is absolutely convergent because of (7.3). 
The equality 


3 1 TELS gt d} aa E(f EDE (a)}] 
= 3 [SONA | centr dr) — pcan) 
kary x 


then implies (7.4) when n — 00. 

According to Lemma 7.1 the correlation between J (a) and functions 
of x, for large k goes to 0 exponentially when k > 0. If f is bounded, 
the reason underlying this fact can be expressed in a particularly simple 


form as follows. 
LEMMA 7.2 Under Condition (Do), if f is a bounded random variable, 
If |S M, on trr tr ` * * Sample space, then 


(7.5) EC | 21} — E{f}| < 2yMpr. 
If f(w) = 1 when 2%,,(@)€ A and flw) = 0 otherwise, this inequality 
reduces to o 

[PPE 4) — plA)| < 2y", 


and this is true even without the factor 2, by (7.1). The general case is 
easily treated as follows: 


[ECF | 2%) = &— Eff} = | j Eff | tial) = PME, dn) — plan) 


< f ELSI Iial) = VKE, dr) 
x 


< MV,(é, X) < 2yMp*. 


LEMMA 7.3 Under Condition (Do), let f () be a function of $ e X, 
measurable with respect to F y, with 


E{f(z)}=0, Ef S/E} = < ©, 
E(D + R f SO) | FSE, dr) 
x x 


= EEDAN S EESE} = oF 
Then X 
l 1 2 iy 
im Bl $ se) |= ot 
and in fact 
(7.6) lim [E{| 2, fæ- nop] = — 2R 2 KEL f (E) KNE) 


If o #0, the limit on the right and o? do not vanish simultaneously. 


§7 THE CENTRAL LIMIT THEOREM 225 


In particular, if the xs are mutually independent, the nth bracket on 
the left in (7.6) vanishes and o,2 = o°. The lemma is thus trivial in this 
case. In general this bracket has the value 


n—-1 © =e 
W'S (n— DESE fend) AR 3, BPE) fan} 


n=1 oo 
=— 2R 2 KE{ fly) fEr)}— 2R 2 Ef fæ fi (CO 


When n — co this becomes the desired equation (7.6), in view of the 
inequalities of Lemma 7.1, with g = for=s=2. If o,=0 and if the 
right side of (7.6) also vanishes, (7.6) implies 

lim. > f@)=0. 

n>o j=l 
But then Li.m. f(@,) = 0, and this is impossible, since E{| f(a,) 3} = 0, 
unless ø = 0, that is, unless f(x) = 0 with probability 1. 

LEMMA 7.4 Under Condition (Dy), let fC) be a function of Ẹ eX, 
measurable with respect to F x, with 
Effe} =o Ef fæ} <0 


for somel=2. Then there is a constant a, for which 
(7.7) E{| > fe} an, n=1,2, +: 
1 


Lemma 7.3 shows that (7.7) is true for /= 2, It will therefore be 
sufficient to assume that (7.7) is true if / is an integer m > 2 and prove 
that it is then true if / = m + 6, where 0 < ô< 1. We thus assume in 
the following that E{| f(a,)|"*°} < co and that the lemma is true if ? = m. 

Let k be a positive integer, to be determined more precisely below, and 


define Sns tns Sns Cn» bY 
n n+k 2n+k 
Sn = È f), = > fE) s = >: fæ) 
1 n41 n+k+1 


n 
en = EZ feed 
Then we are to prove 
mrs 
(1.7) Ge aey 
for proper choice ofa. In order to prove this we first prove that, if e, > 0, 


(7.8) E{|s, + psa [Rae (Qe) Greats PPO Srp e 


226 MARKOV PROCESSES—DISCRETE PARAMETER v 


for proper choice of a, and k. In fact, remembering that s, and §,, have 
the same distribution, 


(7.9) Efls, + Sr} < Eflin + Sal™ Sal? + [Sal 
<2, +25 (7) sr È (7) ist 


Now by Lemma 7.1, with 


= [seh Ae um? 
m+6 m+o6 
r= s=— 
u v 


A1 Elsg)"|s4°} L 2yr? cn + Ells l Els) 
We substitute this in (7.9), giving u and v the appropriate values. In each 
case0 <d<u<m,0<d<v<m. Hence, using Hölder’s inequality, 
the last term in (7.10) is at most 

Els I "E lsn} = Bsp). 
Now we are supposing that (7.7’) is true (with some a) if ô = 0. Hence 
the preceding product is at most const. n™+®/2, Combining these results, 
we find that 


m+ 


aE A a g n Ea, 
for some constants a,, b not involving k. To prove (7.8) we need only 
increase k, if necessary, to make the second term in the parentheses < £;. 
Next we prove that, if e > 0, there is a constant a, and a value of k for 
which 
m+o 
(7.11) Con (2 + 8)e, + an ?, n>1. 


In fact, applying Minkowski’s inequality and (7.8), we find that 
Intk 
Cr = E{|s, + S$, + ta AAR Sæ 


SEmi s, +8, ae = ue s5 EMM ON) Ka, havin |g 


+1 j= 


2 jie + ele, + qn” +m) 4 Ake,tim+|m+2 


Be [a + eDi + eden + ayn Papin +n 
if n is sufficiently large. Then 
Con SL + a" * (2 + een + anit”? 


§7 THE CENTRAL LIMIT THEOREM 227 


if nis sufficiently large. If £; is so small that (1 + e)™*(2 + 4) < 2+, 
there must be an a, for which (7.11) is true. According to (7.11) 
rnei pai 


CS Q+ +a 2 +2492" E eH 


<Q+.6" TERE E 
(2 + e)'c, + a, TEEF r 
l— -ari 


22 


if e is so small that 2 + e <2("+9/2, Then, if e is. chosen in this way, 


mts 
(7.12) Cr <a ?, r>0, 
where 
Q-(m+oy/2 
hehe du beeen, 
1 ays 
2 2 


Finally if n is any positive integer it can be written in the form 


ez De iia! ee EE pi 2h stig iad 
where 
w<n<2t 


and each », is either 0 or 1. Then s, can be written as the sum of r + 1 
groups of sums containing 2”, », 2”, > - » terms and using Minkowski’s 
inequality, (7.12), and the fact that the /(x,) process is stationary, 


en < [Euo fe [rey 4 EUH] pat 
ee oe Him th [toy 2 


<4, [ziz + Qr-WI2 TEE AD ate 1+? 


as a mt 
= a3 | ———_ < anim 
s| Qt 1 


for some constant a, as was to be proved. 

We are now in a position to discuss the central limit theorem. Let 7 
be a probability measure of F x sets, and consider the Markov process 
obtained using 7 as an initial probability distribution together with a 
stochastic transition function satisfying condition (D). The relevant 


228 MARKOV PROCESSES—DISCRETE PARAMETER v 


expectations and probabilities will be distinguished from those obtained 
when 7 = p by use of the subscript 7 so that 


EASED} = | Anas), Ee) = | AOPA, 
x x 


and so on. We wish to show that, for a wide class of functions f, 


n 
= > f@m) is nearly normally distributed for large n, no matter what 
n m=1 
initial distribution ~ is used. The most important cases are 7 = p and 7 
a distribution concentrated at a single point & so that probabilities become 
conditional probabilities under the condition 2,(w) = &,. 
THEOREM 7.5 Suppose that Condition (Dp) is satisfied, and that f() is 
a real function of £ «e X, measurable with respect to F y, with 


(LSE = | IOP) < o 
for some ô> 0. Then 4 


(7.13) lim e| EE = (f@m) — B(stew)| | =o? 


no LW mat 


exists; if o? > 0, and if m is any initial distribution (of x,), 


714) im PALS Send- BeA 
à 


i e720 du 


=% 


ka 1 
o1 VIr 


uniformly in 2. 


The limit equation (7.13) has already been proved (Lemma 7.3) and is 
only restated here for completeness. It is no restriction to assume that 
Ef f(@,,)} = 0, and we shall do so. 

Let «, f be positive integers and let u(« + f) be the largest multiple of 
a + B which is <n. Define Ym, Ym’ by 


(m—1)(a+A) +a 
Yn = > fæ) m= Lge “Ms 
J=(m—1fatA)+1 
m(a+B) 
Ym = > fe) m=l -ny 


(mia Fp) +a 
n 
Yura = > f(x). 
Ma+B)+1 


The theorem is proved by choosing f so large that the y,,’s are nearly 
mutually independent and therefore nearly subject to the central limit 


§7 THE CENTRAL LIMIT THEOREM 229 


theorem for mutually independent random variables, but so small that the 
contribution of the y,,”’s in (7.14) can be neglected when n> œ. We 
first consider the case 7 = p. In the course of the discussion «, 8, and u 
will vary with n, becoming infinite when n —> œ in such a way that (7.15) 
and (7.18) will be satisfied. 

We prove first that, if 


(7.15) (eae 
n>o & 
then 
] REI 
(7.16) plim — > y, =0. 
n> Vn 1 


In fact, using (7.6) and Minkowski’s inequality, 
of Palag Few 
2) | —— Yn| (Ss 2m 
Vn 1 j vn 1 4 
_ [a+ ox? + const. 


H 
<- [fo + const.]!? 4 

Ta [Boy ] = 
ss Jt (Bo; + const)! + La + poi? + const. J"? 
ma Vid: Š Vua 


The right side of this inequality goes to 0 when n —> œ if (7.15) is true, 
and this implies (7.16). Then if (7.15) is true the contribution of the 
Yms in (7.14) can be neglected when n —> œ. 
In the second place we prove that, if ®,,(¢) is defined by 
it E fle) 
Da) = Efe };, 
then for each u 


u 
tly; 

(7.17) Efe) } = DLO + Sun [ul < 2y He 

In fact, the left side of (7.17) is given by 


it E y; A | 
Ele V Efe Yu | ay, * + +5 Mu—ayatayrall 
MEA 
=Efe ' [D+ 
asl 
i E 
=Efe © DOH ml <2, 
where we have used Lemma 7.2. Repeating this argument, we find that 
the left side of this equation can be put in the form 
OE + mF Mt Ini] < 2y, 


230 MARKOV PROCESSES—DISCRETE PARAMETER v 
which implies (7.17). Thus, if we can choose g, B, u to satisfy (7.15) 
and also 

(7.18) lim pp’ = 0, 


n>a 


the distribution to be proved asymptotically normal, with mean 0 and 
variance o,? is that with characteristic function ©,(t/Vn)". This is the 


u 
distribution of > Zm, Where the z,,’s are mutually independent and each 
1 
has the distribution of y,/Vn. Itis clearly sufficient to prove the stated 
u 
asymptotic normality for the random variable D> hese 
1 


Now E{z,,} = 0, and, using Lemma 7.3, 
HEE} = E ESSE >on 2), 


if (7.15) is true (so that «u/n —> 1), and, by Lemma 7.4, 
1 a agl? +912 
Effen} = PED ELE Sæ) < JEE: 


Then, for large n, 


“ 
2 Ef|zml +} = 2ua(a/n)! +6/2 


ts Efz,2y et? ai 
$ 


2+6 


This quotient goes to 0 like (a/n)”® when n —> co, and this implies the 
stated asymptotic normality (for m =p) by II, Theorem 4.4. We 
observe that (7.15) and (7.18) are consistent; for example, we can take B 
as the largest integer whose fourth power is <n, æ = 6%, and then 
u = n|(« + B) = B approximately, so that (7.15) and (7.18) are certainly 
satisfied. 

To prove asymptotic normality for an arbitrary initial distribution 7 
we observe that for any m, using Lemma 7.2, 


ae È Kap, +, = fz) 
|E qer m+n ie Efe"? m+ t 
ri 


Jl 


it n 


n Re : 
L| nao fae ae) = gnc 


<2yp”. 


§7 THE CENTRAL LIMIT THEOREM 231 


Then, ifn > m, 


i Efe) Efe) 
ma) 7) Te: x, 

e Clee Ce) 
A Esap ais Ea) 
LEAP — 1) + Efe” — 1} + 2yp™. 
The third term on the right can be made arbitrarily small by choosing m 
large. For such a choice of m, the first two terms on the right can then 
be made arbitrarily small for ¢ in any finite interval by choosing n large. 


n 
Thus the characteristic functions of n™? X f(a,), n > 1, are asymptotically 
1 


the same, as n —> 00, for the two definitions of probability considered 
here, one with initial distribution p, the other with initial distribution 7. 


n 
It follows that the distribution of n™2 > f(æ,) is asymptotically normal, 
1 


with mean 0 and variance o,2._ This finishes the proof of the theorem. 

Finally we prove a generalization of this theorem which is useful in 
statistical applications. Suppose that in the theorem the functions depend 
on more than one z; Let f(:,. . .,*) be a function of y, © + *, p &« X, 
measurable with respect to F y X +++ X Fx. We consider the random 
variables 


ee 
—= 2 Lian S Lm4r—1) — Ef Em T TEA n=l, 2" 
Vn m=1 


and wish to derive asymptotic normality for n -> 00. Theorem 7.1 
covers the case r = 1, and we shall now show how to reduce the general 
case to this one. To do this, replace X by the space X of points &: 
(&,° + +, Ë), & € X, replace F y by the product field Fy =FRX: 
x Fx, and replace the space of points wœ: (n f° °), &,« X by the 
space of points @: (&, Ep’), ée X. Let &, be the new jth coordinate 
function, so that #(@) = Ë. Define #,, Ža * + »@ probabilities to be the 
same as 44, 4, * © * œ probabilities, where @, is the r-tuple (x; ** *, ®j4r-1)- 
Then the %, process is a Markov process satisfying Condition (Dp) if the 
2, process is such a process. The function f of (&, ° + +, &,) defines a 
function fof é, and the œ random variables 

{fms * Emt) M = VY 
have the same joint distributions as the & random variables {fGEp),m = I}. 
Thus we have reduced the problem involving f to the corresponding one 
involving, Í for whichr = łŁ Applying Theorem 7.5 to the latter problem, 
we obtain the following. 


232 MARKOV PROCESSES—DISCRETE PARAMETER v 


THEOREM 7.5’ Suppose that Condition (Dy) is satisfied, and that 
fC... ->°) is a real function of &, ++ +, Š» measurable with respect to 
Fy X°- xX Fx, with 


E{| fly,» ED} <0 
for some 6>0. Then, if fn =f(@m>* ` *s &m+r-a)> 


exists; if 0,2 > 0 and ifm is any initial distribution (of x,), 
4 


1 * 1 h A : 
(7.14) lim p.f- EE <1) = | Raia Ju 
) a PRAI U, UDES TOF i 


uniformly in 2. 

As an example of the application of Theorems 7.5 and 7.5’, let 
Xy a * * * be mutually independent random variables with a common 
distribution function. Then Condition (D,) is certainly satisfied. Theo- 
rem 7.5 is applicable, but reduces to III, Theorem 4.3 (which has a weaker 
hypothesis, ô = 0). Theorem 7.5’ yields something new. The limit 
yariance o, in (7.13) is, according to Lemma 7.3, 


of = EDP) + 2S Effen) PUAN 


[This is also easily evaluated from (7.13’).] As a particular case suppose 
that the a's are real-valued random variables, with 


Efqj=0, Efx} =a Efe} = ay, 
and that r = 2, fi = (x, — 2). Then 


of = 4a, 
and we find that 
a È (emis tn)? ~ 2a] 
N m=1 
is asymptotically normal with mean 0 and variance 4a4, if E{la,|4+7} < 00 
for some 6 > 0, 

The Condition (Do) was used to simplify the formal work above. If 
Condition (D) is satisfied and if there is only one ergodic set but if there 
may be cyclically moving subsets, Theorems 7.5 and 7.5’ are easily 
extended to show asymptotic normality of the same sums. If there is 
more than one ergodic class, however, the limit distributions may be 
weighted averages of normal distributions. 


§8 MARKOV PROCESSES IN THE WIDE SENSE 233 


8. Markov processes in the wide sense 


Let {£}, t e T} be a family of real- or complex-valued‘random variables 
and suppose that 


E{|x,|7} <0 
for all ¢. 
Define R(s, t) by 
E{x,a,} 
R(s, t) = > Elz)}> 0 
mpy “P 
=0, E{{x,.%} = 0. 


Then x, — R(s, t)z, is orthogonal to £s, that is, 
Pfr, |x} = RCG, Nes 


According to II §6 the x, process is a Markov process in the wide sense 
if, whenever h <° © =< tm, 


(8.1) E(x, | 2, see age F =p Pfr., EPRA 


with probability 1. 
THEOREM 8.1 An æ, process is a Markov process in the wide sense if 
and only if E{|æ;|?} < œ and R satisfies the functional equation 


(8.2) R(s, u) = R(s, t)R(t, u), Seu. 
To prove the theorem define y(t, u) as the difference 
x, — R(t, ux, = y(t, u). 


Then y(t, u) is orthogonal to 2,. If the x, process is a Markov process in 
the wide sense, 


(8.3) Efe, lto t} = Efex, |x} = RU, we, 
so that y(t, u) is orthogonal to x, also, 
(8.4) Efx} — R(t, wE{x,z,} = 0. 


This is equivalent to (8.2). Conversely, if (8.2) is true, (8.4) is also true, 
and this equation states that y(¢, u) is orthogonal to every x, for s < t; 
that is, 

R(t, u)v, = DEA |x} = EA EZE xy} 


with probability 1 ifs; <<--*<s,<!<4, and this is equivalent to the 

condition (8.1) that the process be a Markov process in the wide sense. 
In particular, if the x, process is real and Gaussian and if Efx,) = 0, 

the condition of the theorem is necessary and sufficient that the x, process 


234 MARKOV PROCESSES—DISCRETE PARAMETER v 


be a Markov process in the strict sense (II §6) in accordance with the 
general relationship between wide sense and strict sense concepts. 

Theorem 8.1 becomes particularly simple if the x, process is stationary 
in the wide sense (see II §8). In this case R(s, t) depends only on the 
difference t — s, so that we can write R(s, t) = R(t—s). The condition 
(8.2) then becomes 


Ry + 4) = RRG) thao. 
If the xs are a sequence £y, Xo ` * *, this means that 
R(n) = E2 mnim} = a” RO), n>0 
for some constant a. By Schwarz’s inequality 
[RO] < RO), 
so that |a| < 1. In the continuous parameter case 
R(t) Ela xc) = e—*R(0), t>0 


where c has-non-negative real part, if R(-) is known to be continuous. 
Finally we remark that, if the xs are a sequence z, tg, * * -, the con- 
dition (8.1) is easily shown to be equivalent to the condition 


(8.1%) Piz, |21,° + +, Enay = Efe, | e,4} 
(with probability 1) for alla > 1. 


CHAPTERS VMI 


Markov Processes— 
Continuous Parameter 


1. Markov chains with finitely many states 


This section leans heavily on V §1 and §2, which treat the discrete 
parameter case. Let {e,0<t< oo} be a Markov process with a finite 
number of states labeled 1,- - +, N. That is, we suppose that 


N 
2 P{x(w) =j} = 1. 
ja 


(In most applications the random variables of the process actually do not 
assume other values than 1, > * -, N, but this restriction makes no 
difference in the theory.) If P{x,(w) = i} > 0, define p,,(s, t) by 


pals, 1) = P(x) =j L20) = i) 
and let P(s, t) be the matrix [p;(s, t)]. Then, if P{x,(w) = i} > 0, 


(1.1) pis, t) = 9, > pits, t) = 1, 
I 
and 
(1.2) Pils, u) = È puls, Opnlts u), O0<s<t<u, 
i 


[where in (1.2) we only sum over values of j for which p;,(t, u) is defined.] 
In matrix notation the preceding equation, a special case of the Chapman- 
Kolmogorov equation, becomes 


P(s, u) = PCs, t)P(t, u), O<s<t<u. 


It is convenient to define P(t, t) as the identity matrix, and we shall use 

this definition throughout. The Chapman-Kolmogorov equation system 

(1.2) is then valid for 0<s<t<u. The stochastic process is said to 

have stationary transition probabilities if for each pair i,j, whenever 
235 


236 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


P{æ (w) = i} > 0, the transition probability p;,(s, t) depends only on t — s. 
In this case we write p,,(t — s) for p;,(s, t), and (1.1) and (1.2) specialize to 


CHD) Pult) = 0, È put) =1, t> 0, 
i 


(1.2) Pals + t) = È PislS)Pnt) s, t> 0. 


i 
In matrix notation the preceding equation is 
P(s + t) = P(s)P(t). 


The matrix P(0) is by definition the identity matrix. 

A matrix function P(,-) satisfying (1.1) and (1.2) will be called a 
Markov transition matrix function. (To avoid needless complexity we 
suppose that every element of the matrix function is defined for all values 
of the arguments.) A matrix function P(-) satisfying (1.1’) and (1.2’) will 
be called a stationary Markov transition matrix function. Given a Markov 
transition matrix function, there is a corresponding Markov process 
{x,,0 < t < co} obtained by choosing any initial probabilities P{xọ(%) = i}, 
i=1,: ++, N, and, for every finite t set 0 = tọ < h <* ++ < tw defining 


Pix (w) = ap + *5 %(@) = a,} 
= P{xo(w) = 4} Paya,(0; t1) > * * Pay.sa,Atn—a> tn) 


(see II §6). These basic probabilities determine an x, process which is a 
Markov chain with the given initial and transition probabilities. The 
basic w space can be taken as the space of all functions of t, O< t < © 
which assume only the integral values 1,- + +, N or as the space of all real 
functions of t, 0< t < œ, without this restriction. 

THEOREM 1.1 Jf [p,,(-)] is a stationary Markov transition matrix 
function, then yum Palt) exists for all i, j and the limit is approached 

<0 


exponentially fast. 

Note that this result is simpler than that in the discrete parameter case, 
because Cesaro limits are not required here. Note also that we have 
made no continuity or even measurability assumptions whatever. To 
prove the theorem we first fix t and prove that 
(1.3) lim pant) = w,,(t) 

n>n 
exists for all i, j. We shall then show that the matrix W(t): [w,,(t)] does 
not depend on ż, and finally we shall consider the asymptotic behavior of 
pult) when t —> œ in an arbitrary manner. We have seen in V §2 that 
either lim p,,(nto) exists for all i, j [when the stochastic matrix P(t) does 


no 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 237 

not determine cyclically moving classes of states] or that for some choice 

of » lim p,,(nvto) exists for all i,j. Here v is any integer divisible by 
n> 

certain cycle lengths denoted by dı, də, ` * > in V §2, and > d; < N, so 


j 
that we can certainly set » = N! Since fy is arbitrary, we can derive (1.3) 
by setting t = N!f. Obviously 


Ws + t) = WOWO) 
W(t)" = Wnt) = W(t), Fe Mp PAS Se 
Now suppose that s < t. Then 
W(s + t) = W2s)W(t— s) = Ws)W(t— s) = Wit), 

so that W(u) = W(t) for t<u<2t. Hence W(t) is independent of t, 
W(t) = W: [w,,], and (1.3) becomes 
(1.3) lim p,(nt) = Wi; 

n>a 
for alli, j, t. Here W = W?and the matrix W has the characteristics of 


a limit matrix described in V §2: there are (ergodic) classes Pien 23 
of states and a class F of transient states such that: 


pit) = 0, (i € Ew j ¢ Ea), 
Wis = a; i,j € Ew 
= p(Eaan; (ie RF je Ea), 
=0 (i € En j ¢ Ea), 
=0 jeF, 
where C 
am > 0, jekw 3 am=l, 
jek, 


pE) 20, > pl.) =. 


Just as in the discrete parameter case of V §2, > pit) is monotone 
yan 


€ 
non-increasing in t, for each i. When £ —> % with t integral-valued, we 
have the discrete parameter case, and know that this sum approaches 0 
exponentially fast. Hence the same is true when f —> 0 with no restric- 
tion on the value of t, 

lim p,(t)=0, jeF. 


t>o 
To investigate p,,(t) for i,j € E,, define m,‘, M; by 


mf? = min palf), Mj! = max p) J € Bw 
ie E, ie Ey 


238 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


Then, just as in the discrete parameter case, see V §2, Case (b), m,‘ and 
M;\ are respectively monotone non-decreasing and non-increasing. 
When 1 — œo with ¢ integral-valued, we are in the discrete parameter case, 
and know that these functions approach „7; exponentially fast; hence the 
same is true when ft —> œ with no restriction on the values of 1, and we 
have proved that 

lim palt) = aT; 


too 


and that the convergence is exponentially fast. Finally 


Pals + t= > POP + È Pisl)Punkt) ieF 
j¢F jeF 
k¢F 


so that when 1 —> œœ the first sum approaches a limit exponentially fast 

for fixed s, whereas the second is at most > p(s), which is (exponen- 
jek 

tially) small for large s. Therefore lim p,,(u) exists when ie F, kq¢F 


uo 
and the limit is approached exponentially fast. This finishes the proof 
of the theorem. As in the discrete parameter case, the rows of the limit 
matrix are sets of stationary absolute probabilities and every set of 
stationary absolute probabilities is a linear combination of these rows. 

We proceed to a discussion of the sample functions of a Markov chain 
with a given stationary Markoy matrix transition function. We shall 
always make the assumption that P(-) is continuous when ft = 0, 


(1.4) lim p()=1 i=j 
t0 
Lo: 


[The first line implies the second in view of (1.1⁄).] This condition that 
the p,,(:)’s are continuous when t = 0 implies that they are continuous 
for all t, because 


lim pis +6) = lim > pils)pr(2) = pil) 
e40 eyO k 

lim [p;(s)— pisls— £)] 

640 


= lim [> Pils = €)Pys) = Pals — e)] 


Ao & 
= lim [ > Pals — E)Pr;(£) + pis — gl Ps] 
eyo LeAj 


=0. 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 239 


Since 
Pix (w) + v,(w)} = X Pfe,(o) = D — palt — sI 
<Max[1—pid\t—s))], 7 = Min (s, 4), 


the condition (1.4) is equivalent to the condition that the quantity on the 
left goes to 0 when t —> s, for every initial distribution of probabilities, 
and all s. That is, (1.4) is true if and only if 
(1.4) plim z, = z, 
ts 

for every initial distribution of probabilities. The stochastic limit (1.4) 
will be strengthened to a probability-one limit in Theorem 1.2. 

If (1.4) is true p;(t) > 0 for small t. Hence the inequality 


Pils + 1) > Pilpak) 


implies that p,(t)> 0 for all £ > 0. Moreover, if j Æi, pj(t) either 
vanishes identically or never vanishes except when t = 0. In fact, fixing 
i and j with i + j, suppose.that p,,(s) > 0 for some s which is fixed in the 
following argument. Then we prove that p,(t)> 0 if > 0. Since 


Pit) = ppt — u) O<u<t 


and since the second factor on the right is always positive, it is sufficient 
to prove that p,,(u) > 0 for some u < t. Now, for every positive integer 


m, pum 5) > 0, that is, in the language of V §2, j is a consequent of i 
of order m for the stochastic matrix (= . It was proved in V §2 that 


j is then a consequent of i of some order n< N, poln z2) => Orsay: 
m> Nsjt, the desired value of u can be taken as ns/m. i 

The fact that, if (1.4) is true, p(t) never vanishes for ¢ > 0 unless it 
vanishes identically shows that the stochastic matrix P(r) (t fixed) cannot 
determine cyclically moving classes of states. This fact has already been 
proved above in connection with Theorem 1.1 without the assumption 


that (1.4) is true. 
Suppose now that pis’) has a derivative p,j(-) for all t = 0 and set 


P 1 — pult , 
g= im OO 


t0 


(1.5) 
y(t 
qy = lim Pit) = pis (0). 
t>0 t 


240 MARKOV PROCESSES— CONTINUOUS PARAMETER VI 


Let Q be the matrix [9;;], where we set qu = — qi- Using (1.1), we 
find that 


(1.6) G20 È Ws =U 
i#i 


Differentiating the Chapman-Kolmogorov equation (1.2’) with respect to 
each variable and setting it equal to 0, we obtain two systems of differential 
equations, 


AD pit) = — GP) + 2. VP) it ates) EE 8 
jži 

(1.7) PaO) = — Pie t 2, PaP) ik=1, 4, N. 
jee 


The first system is called the backward system, the second the forward 
system, for reasons to be given below. In each case the initial conditions 
are given by 


(1.8) pa) = 1 i=j 
=0 ix). 


The qs and q,,’s determine the p,,(t)’s uniquely. We shall investigate 
this question from two points of view. Firstly, we shall show that the 
systems (1.7) and (1.7) with initial conditions (1.8) have a unique solution 
satisfying (1.1’) and (1.2’) if Q satisfies (1.6). Secondly we shall construct 
a stochastic process by means of a given Q satisfying (1.6) which will be 
a Markov chain and whose transition probabilities will satisfy (1.7), 
(1.7/), (1.8). 

We consider first the system (1.7). This system of differential equations 
is studied most easily in its matrix form 


P(t) = QP(t) 
with the initial condition 
PO) = 7, 


where / is the identity. Then we can write a solution [to both (1.7) 


and (1.7’)] 
P(t) = e, 


where the exponential of a matrix is defined as the (element by element) 
sum of the exponential series. It is clear that, for any matrix Q, P(t) as 
so defined furnishes a solution of (1.7) which satisfies the initial conditions 
(1.8) and the Chapman-Kolmogorov equations (1.2’). If Q satisfies 
(1.6), P(-) is a probability solution in the sense that (1.1) is also true. In 
fact, firstly, if Q satisfies (1.6) we sum over k in (1.7’) to obtain 


2 Px (0) =0 


NI MARKOV CHAINS WITH FINITELY MANY STATES 241 


so that > p(t) = const., and this constant is 1 since it is | when t= 0. 
k 


Secondly, the following argument shows that the p(t sare > 0. Suppose 
that no q; vanishes. Then p,,(t)> 0 for sufficiently small ¢ since, if 
j=i, p„(0)= 1 and, if j + i, pO) = 0, pa (0) =q; > 0. Unless the 
pit)'s are always > 0 there is a finite positive 6 such that ô is the largest 
h for which 

Dae 05 subj Gere Ns) Oss reer 
But the equation P(6 + h) = P(6)P(h) shows that the elements of P(t) are 
> 0 for £< 26, in contradiction to the definition of 6. Thus the p,,(t)’s 
are all > 0 if no q,, vanishes. In the general case let Q,, be the matrix 
Q with every qu = — q; replaced by q; + (N — 1)/n and every qy (j + i) 
replaced by qa + 1/n. Then Q, satisfies (1.6) and has no vanishing 
elements, so the elements of P,,(¢), defined by 

P,(t) = a, 

are >0. When n—> œ an elementary calculation shows that P,,(t) 
becomes P(t), so that the elements of P(r) are all > 0. We have not yet 
shown that the solution we have obtained is unique. However, if 


P(t) = QP(), 
POM(t) = Q"P(t). 


An application of Taylor’s theorem with remainder then shows that P(t) 
must be given by 


it follows that 


P(t) = eP(0), 
as was to be proved. 

The system (1.7’) is, in matrix form, P’(t) = P(t)Q, and can be treated 
in the same way. It has solution 

P(t) = P(O)e™. 
Thus the systems (1.7) and (1.7) have the same solution under the initial 
condition P(0) = 7 which we have accepted. 

The following two examples are typical of the simple examples that 
arise in practical applications. In such examples the q,’s and qj,’ are 
commonly given by theoretical considerations or deduced from experi- 
mental data, using the fact that, neglecting second-order terms, qiy dt is 
the probability of a transition from i to j in time dt, and 1 — q; dt the 
probability of no transition. 

Example 1 Suppose that only transitions from i to i + 1 are possible 
(and from N to 1), and in fact that 


q=9 jit (mod N). 
=q jaitt 


242 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


Then (1.7) becomes (if we fix k and identify MN + 1 with 1) 
Pie (t) = — QP it) + Put) i=l, N. 
If p(t) = e%p;,(t), we find that 
Bix (t) = qfilt), Bi(O) = Six. 
` Tt follows that ae x 
PO = 7 Palt), 


so that p,,(¢) must have the form 

a 2n YZI A 
Pal = > ene, ene a 
vel 


and, substituting in the difference-differential equation for f,,(:), we find 
that we can write c,” in the form c” = a,c. Combining this with 
the initial conditions, we find that 


1 


Palt) = N 2, Cae tgs 


so that 
I< i kgat(ty-1) 
Palt) = N 2 Ay ` 


If» = N, «,— 1 = 0; otherwise «,— 1 has a negative real part. Hence 
1 

lim p(t) ==, 

E Palt) N 


as was intuitively clear in the first place. 

Example 2 In Example | there is only one ergodic class of states, and 
there are no transient states. To exhibit other possibilities we modify the 
above example by making the system remain in state N after reaching it: 


GW=O jÆitl 
=q j=i+tl 
qn = 9. 
In this case (1.7) becomes 
Pie) = — qpa) + Pik) i<N 
Pur (t) = 0, 
with initial conditions p,,(0) = ô; Then 
Pyt)=0 kAN 
Pyr) = 1, 


i=], < N=I 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 243 


and it is easily verified that the solution is 


Palt) = 0 k<i 
Di —qt 
-o i<k<N 
E (qt) ae | 
= qt a | a a OR u = 
e [e 1—qt (N—i—1)! k=N. 


In this example there is only one ergodic class of states, the class con- 
taining the Nth state only. The other states are all transient. 

The preceding example illustrates a general procedure. Given any 
Markov chain as studied in this section, one can choose a state, say the 
Nth state, and modify the process so that if the system ever reaches that 
state it remains there. In terms of the q;’s and q,,'s this means simply 
that gy and qyp * * *» NN- ate replaced by 0. In terms of the old 
process and sample functions the new transition probability p,,(¢) obtained 
in this way is the probability that, if x(w) = i, then x(w) = j and 
x(w) Æ N for any s < t; Ďix(t) is the probability that, if x(w) = i, then 
x(w) = N for somes < t. This procedure can be consideréd as making 
the state N an absorbing barrier. Several states can, of course, be made 
absorbing barriers simultaneously. 

We shall now investigate in detail stationary Markov transition matrix 
functions, under the continuity restriction (1.4). It will be shown that 
the transition probabilities necessarily have derivatives which satisfy the 
systems of differential equations (1.7) and (1.7’). Particular attention 
will be paid to the relation between the properties of the transition 
probabilities and the properties of the sample functions of the separable 
Markov processes determined by the matrix transition functions. We 
shall always adopt a unique definition of a conditional probability of the 
form P{A |x (%0) = i}, where A is defined by restrictions on sample 
functions for t >t. We take the obvious definition in terms of the 
given transition probabilities, and accept this whether or not P{a,() = i} 
happens to be positive. 

In the rest of this chapter it will frequently be convenient to denote a 
random variable depending on the parameter í by x(t), rather than X, 
and to denote its value at œ by x(t, w), rather than x(w). 

THEOREM 1.2 If [py] is a stationary Markov transition matrix 
function satisfying (1.4), the limit 
a9) p ER 


to 


iO 


exists for all i, 


244 MARKOV PROCESSES—CONTINUGUS PARAMETER VI 


If {x(t), 0< t < oo} is a separable process determined by [p,(°)] together, » 


with an initial probability distribution, then 
(1.10) Pfa(7, 0) =i, LTI to +a |x, 0) =} =e, 


and, if %(to, ©) = i, then x(t, w) = i in some neighborhood of ty (ions 


size depends on wọ), with probability 1. 

We observe that the last clause does not state that almost all sample 
functions are continuous, but only that at most the sample functions of 
‘a collection of sample functions*of probability 0 have a discontinuity at 
any particular value of z. 

The existence of the limit in (1.9) is, of course, a purely. analytic’ fact 
which has of itself nothing to do with the theory of probability; it is 
implied by the conditions (1.1’) (1.2’), and the continuity condition (1.4). 
However, it is convenient and somewhat instructive to use probability 
theory to prove the existence of this limit, and we shall do so. Suppose 
then that the %(¢) process is as described in the second half of the theorem. 
Then the conditional probability on the left is a function /(-) which 
satisfies the functional equation š 


Se+ A= %6> 0. 


Moreover, 0 < f(«)< 1. Hence f(:) must be monotone non-increasing, 
and the only solution of the functional equation under this restriction has 


the form flee) = encom, 


where the constant is non-negative and may be + œœ, in which case we , 


interpret the exponential function as 0. We, therefore, have (1.10) with 
some constant q; [not yet identified as the limit in (1.9)], where 
0<4,< + ©. The following argument excludes the case q; = + 00. 
Suppose that 0 = To < * +: < T, =« and define ,p,, by 
Pu =l, »=0 
= Pfa(r,,0) =i,7=0,-++,»]20,o) =i}, v>1. 
Then, if e > 0 and if « is so small that p(t) > 1 — e for all j when 1< a, 


n-2 
T= e < pula) = „apa + 2 2. PaPa, — 7 ,)P sil Ta) 
»=0 jži 


n-2 
Saat 2 >; PeiPiTr41 — THE 
»=0 jži 


S Pu + ell — pp). 
According to the definition of a separable process (see II §2) nPii can be 
made to approximate f(a) arbitrarily closely by choosing the TS properly. 
Hence the preceding inequality implies 


1— e < fa) + e€1—f(@)], . 


” 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 245 


and ¢ can be.made arbitrary small in this inequality by choosing « small. 

i This inequality excludes the possibility that f(r) = 0 if we take £ < $ 
Then /(-) must be the exponential function written ‘above, and we have 
also proved that, for any e > 0, if t is sufficiently small, 


ce a Pill) Se + (1 — eo) 
Combining this inequality with the inequality 
; s pdt) fO =e, 
we find ; 
1 — et = eA 
’ ma eye 1 pil) to Lae 
t t t 


and-thistimplies (1.9). Finally we find from (1.9) that, if 


Pfx(ty, o) =i}>0, h SO, and 0< h< to 
then 
Pix(r, w) = i, fo h S T< lo +h llo w) = i} 


Pilo = hs 0) = i} p-ar : 
Pfa(to, 0) = i} : 

_ Ifhy $ 0, ha } 0, the quantity on the right goes to 1, and this fact means 
that, if a(t, ©) = i, then a(t, w) = i for t near ty except possibly for a set 
of sample functions of probability 0. The sample fi unctions are therefore 
almost all continuous at fọ, as was to be proved. 

Note that, if the existence of the limit q; in (1.9) is made a hypothesis, 

the probability (1.10) can be evaluated at once, as follows. According 

to the definition of a separable stochastic process (II §2) there is a sequence 
{t,} in the interval [f, f + %] such that the two w sets 


Salt, 0) =i 72>, 2,0) =i StS a 
differ by at most a set of probability 0. Moreover, {t} can be taken as 
any sequence dense in the interval [fo fy + «] because of the continuity 
condition (1.4’). In particular, if we take {t,) as the sequence of all 
points of the form fy + kaj?" k= 0,+ > +, 2",n=1,2,+ > +, we find that 
Pix(7, 0) =i, to STS lo + a} J 
= lim Pfx(ty + ko/2", 0) = i k =o, Wy 


n= 


= Pfx(ty, o) = i} lim pala?" , 


= Plal(ty, 0) = ijet on, 


and this is equivalent to (1.10). ? 
A function g( will be called a step function if it has only finitely many 
points of discontinuity in every finite closed interval, if it is identically 
y 


246 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


constant in every open interval of continuity points, and if, when % is a 
point of discontinuity, 


GM) go) giS got) or — gl—) Salto) = zlo +). 


A function g(-) will be said to have a jump at a point fọ if it is dis- 
continuous there, but if the one-sided limits g(f —) and g(t +) exist and 
satisfy one of the two preceding inequalities. The discontinuities of a 
step function in the interior of its interval of definition are jumps. A 
function of ¢ taking on only finitely many values and continuous except 
for jumps is a step function. 

We shall prove that almost all sample functions of a Markov chain of 
the type we are considering are step functions (if the process is separable). 
Theorem 1.2 shows that the probability of a discontinuity at any one 
point is 0. The following theorem goes considerably further. 

THEOREM 1.3 Let [p,,(:)] be a stationary Markov transition matrix 
function satisfying (1.4). 

(i) The limits 


b ilt TR 
(1.12) in Pl) quy i#j 
exist and 
(1.13) > qü =q; 
jżi 


(ii) Let {a(t), O< t < «} be a separable process determined by [p,(-)] 
together with an initial probability distribution. Ifq; > Oand if x(t, w) = i, 
there is with probability 1 a sample function discontinuity for some t > to, 
and in fact a first discontinuity, which is a jump; if O <a< œ, the 
probability that if there is a discontinuity in the interval [to, tọ + %) the first 
is a jump to j is qys/qi- 

Let {a(t), 0< t < œ} be a separable process as described in (ii). It is 
no restriction on the transition matrix function to assume the existence 
of the x(t) process. In proving (ii) it is no restriction to assume that 
1) = 0, and we shall do so. 

If q; = 0, (1.10) shows that, whenever x(0, w) = i, a(t, œ) = i for t > 0, 
so that pa(t)=1. The theorem is thus true in this case, and we shall 
assume q; > 0 from now on. Let ô, t be any positive numbers and let 
mò be the smallest multiple of 6 which is > 7. The probability p,,(mé), 
J #i, is at least equal to the probability that, if 2(0,) = i, then 
a(md, w) = j and the first change of (6, w) as the integer u increases is 
a transition from i to j; analytically 


PAm) = "> PAPOM» DD: 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 247 


If 2 is so small that p,,(s) > 1—e for s < t, this inequality becomes 


1— p,6)” 
pmd > (1 = 9) PAX pd 


When 6 —> 0 this implies, by Theorem 1.2, 


— eat 
: lim spe 


qi aw Ô 


Pit) = (1—8) 


The limit superior on the right must therefore be finite. We now divide 
both sides by t and obtain, when t > 0, using the fact that e can be made 
arbitrarily small, 

Pi) 


ge 


lim nO > lim sup 
to á t 60 


Then the limit q; in (1.12) exists, and (1.13) follows from the equality 


Palt) l 1 — palt) 
jut ee 
We prove now that for any « > 0 the first discontinuity if any of a sample 
function for O<t<«a is a jump. Fix i, take the initial condition 
2x(0, œ) =i, and consider the œ set A,, g? for which, for some », 
2<v<n, 
-1 
a(7,0) =i, OSTs pole 
Gi) 
væ Va 
a7, 0) =j,- StS +8. 

n n 

Then 


iV) 


n 
P{A,, ny |, w) = j= Dec * Pislsine 
v=2 
esel x es 
= pee PE 
=> (1— et) He, n> o: 
qi 
It follows that, if A,’ is the set of points œ in Ap, a° for infinitely many 
values of n, that is, 


then ae 
PLA, | 20, o) = i} > (1 — e) = emu, 


i 


248 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


The set A,” increases when f decreases. Hence, if AÙ is the set of 
points w in A, for some 6 > 0, 
PLA® | 20, w) = i} > (1 — e712) t, 
qi 
Any sample function corresponding to an w in A‘ is identically i in some 
interval with left-hand endpoint at t = 0, has a discontinuity at 7, <, 
and is identically j in some open interval with left-hand endpoint 7,. 
Since 


> P{LA® | xO, œ) =i} > 1 — et = Pir, w) Fi,0< 7 <4}, 
j 


there must be equality in this inequality, and therefore also in the preceding 
one. Thus excluding an œw set of probability 0, x(t, œ) =i for t < 7, 
and a(t, w) = j for tr, <t <7, + 6, where j and 6 depend on w, and, 
on the assumption that there is a discontinuity in the interval (0, «), the 
probability that 2[7,(m) +, œ] = j is qulqi Ifq; = 0, and if x(0, w) = i, 
the probability of a discontinuity in the interval (0, «) is 0 for all «, as 
we have already noted above. If q;> 0, and if x(0, œ) = i, the prob- 
ability is 1 — e~“* that there is a discontinuity in the interval (0,«), that is, 


P{r,(w) > æ | x0, œ) = i} = 1 — e «z0. 


We have not yet proved that the discontinuity at 7,(%w) is a jump for 
almost all œ because we have not verified that the condition (1.11) is 
satisfied at 7,(w). This not very important fact is a consequence of the 
following argument. By the definition of separability of a stochastic 
process (II §2) there is a denumerable ¢ set with the property that, 
neglecting an œ set of probability 0, 2(-, œ) has the same least upper and 
greatest lower bounds on every open ¢ interval as on the part of this 
denumerable ¢ set in the interval. Since the probability of a discon- 
tinuity at any point of this denumerable set is 0, a[7,(w), w] must lie 
between 2[7,(m) —, w] and 2[7,(w) +, w] inclusive, that is, the discon- 
tinuity at 7,(w) is a jump, with probability 1. This fact is a more or less 
accidental result of our definition of separability of a stochastic process. 
If the process is supposed separable relative to the closed sets, it follows 
that the sample functions are even almost all continuous on the right or 
left at 7,(w). 

THEOREM 1.4 The sample functions of a separable Markov chain (with 
a finite number of states) which has stationary transition probabilities 
satisfying the continuity condition (1.4), are almost all step functions. 

We shall suppose in the proof that there is a stationary Markov tran- 
sition matrix function which together with an initial probability distribu- 
tion determines the process. This is a slight restriction since the transition 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 249 


probability P{æ(t, œ) = j | 2(s, w) = i} may not be defined for all i, j, but 
it will be obvious what reservations to make in a few statements to make 
the proof perfectly general. 

Let {æ(t), 0< t < œ} be the process in question. We have seen that 
(neglecting a set of sample functions of probability 0), if æ(0, w) = i and 
if q; = 0, then x(t, œ) = i, whereas, if q; > 0, there is a first discontinuity, 
a jump, at a point 7,(m). If g;=0 set 74() = 7() =: + + = 2. 
Continuing the argument, it is easily proved that, if a(7, +, œ) = j and 
if q;=0, then z(t, œ) =j for t> tw), in which case we define 
Taw) = Taw) =" - -= 00, whereas, if q; > 0, there is a first discon- 
tinuity after r,(), a jump, at a point 7a(w), with 


P{r,(o) — n(o) = a | airl) +,0] =j} =e, 
and so on. In general we have not only 
Pirna (0) — Talo) > « | ziro) +, o) = j} = e 


(in case q; > 0), but to the condition a[7,(~) +, w] = j we can add other 
conditions which involve sample functions at parameter values < 7,,(«) 
without changing the exponential function on the right. This fact 
explains why the procedure can be continued indefinitely. To prove the 
theorem we show that lim 7, =-+ œ% with probability 1. Let 


nro 


q = Maxq;. Then fr, n > 1} is a sequence of random variables with 
j 
Pirna (0) — Taw) = a} 


N 
= > Pf{ray(w)—7,(o) = « |2[7,(@) +, 0] = /}P{2[7,(@) +, ©] =j} 
jel 


= $ eM Piir) +0] =} 


j= 


Sey n>0, “2>0. 


Here we have put rọ(w) = 0, and interpret 00 —c¢ as 9% for c= ©. 
Then infinitely many differences 7,:,(@)—7,(@) will be >, with 
probability >e~™, so that lim 7,(@) = % with probability > e~™ for 


every «> 0. This probability is then I, as was to be proved. 

Let 1p;,(t) be the probability that, if a(to, w) = i, then a(t) + t, w) = j 
and that the transition from i to k has been accomplished in a single step. 
One is tempted to evaluate ,p;,(1) by stating that (7, œ) must be identically 
i to the jump point s (probability e~“*), that there must then be a jump 


250 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


to k (probability qix ds), and that then z(7, œw) must be identically k to t 
(probability et), so that 
j et-a- gads ok xi 


0 


wilt) = 


— 0 k =i. 


This reasoning will now be justified in full detail, but similar reasoning 
will be used below without further justification. Take fọ = 0 again, fix 
jand k +i, let M, be the œ set for which, for some r, 0 <v <n—1, 


v 
x7, o) = i, Oe cat 


y+ 
She heres, 
n 


and let M be the set for which, for some s, Cas, 
glr, œ) = i, O<7<s 
= fk, SLTA 
Then M is the set of points œ in infinitely many M,,’s as well as the set of 


points w in all but a finite number of M,,’s, that is, in the usual language 
of set theory lim M, = M, and it follows that 


lim P{M, |20, o) = i} = PM |20, o) = i} = spall). 


n> 


On the other hand 
“et ARS t rap fae a 
PIM, [20,0)=i}="S epa(t)e* * 
v=1 


and when n > © this sum approaches the integral evaluation tentatively 
obtained above for ;p,,(t), an evaluation which is thereby justified. 

Let ,p,,(t) be the probability that, if (%, ©) = i, than (to + t, ©) =k, 
and the transition from i to k has been accomplished in steps. We have 
already evaluated ,p,,(t). In the same way we derive equations for 
niaPixt) in terms of ,pin(t), and thereby obtain an inductive definition of 
the sequence, 

opali) = 0, k #i, 


(1.14) = eti, k= i; 


t 
napilt) = 2, l ei; Palt — s) ds, n20, 
ii 4 


$1 MARKOY CHAINS WITH FINITELY MANY STATES 251 


or alternatively 
oPix(t) = 0, k#i, 
(1.14) Sik k =i; 


ntiPalt) = 2 f npag ds, hie 0: 
j#k 4 


Since almost all sample functions are step functions, it follows that 


(1.15) Pid) = 2, nPinkt)- 
If (1.14) and (1.15) are combined, we find 
t 
(1.16) Pad) = dye + > | e-@qupalt—s) ds, 
j#t 0) 


and, if (1.14’) and (1.15) are combined, we find 
t 
(1.16) Palt) = óne ED, EZO a ds. 
jfk 9 


These two integral equations are so important that we derive them 
independently to exhibit their probabilistic meaning. The first exhibits 
Pilt) as the probability that, if x(0, w) = i, either æ(7, œ) remains at i for 
time t (possible only if k = i, in that case probability e™%) or else 
æ(r, w) will remain at i to time s (probability e~™*), will then jump to some 
state j + i in time ds (probability q; ds), and will go to kin time t— s 
[probability p(t — s)]. Thus (1.16) depends essentially on the fact that 
almost all sample functions have a first discontinuity, which is a jump. 
Similarly (1.16’) depends essentially on the existence of a last discontinuity, 
which is a jump, in the interval [0, ¢]. 

Equations (1.15) and (1.16) show that the elements of a stationary 
Markov transition matrix function satisfying the continuity condition 
(1.4) have continuous derivatives. Taking derivatives in these equations, 
we obtain (1.7) and (1.7’) respectively, obtained earlier by direct differenti- 
ation of the Chapman-Kolmogorov equation. The differential equation 
(1.7) and (1.7’) can thus be interpreted in terms of the continuity properties 
of the sample functions. This interpretation becomes critically important 
in its generalized form for the processes to be discussed in §2. 

Equations (1.14) or (1.14), and (1.15), provide an explicit algorithm for 
calculating p(t) in terms of the q,’s and q,,’s which has more probabilistic 
significance than that given above, P(t) = 22, in the course of the solution 
of the systems of differential equations (1.7) and (1.7). 


252 MARKOV PROCESSES—CONTINUOUS PARAMETER MI 


Suppose now that 4, °° s qm: fi? j=l N@A/J) are any 
constants satisfying (1.6). We have already proved that there is a 
stationary Markov transition matrix function [satisfying the continuity 
condition (1.4)] with these q,’s and q,;’s, that is, satisfying (1.9) and (1.12). 
In fact, we showed above that eœ? is such a matrix function. A second 
method is to define p;(t) by (1.14) and (1.15), or (1.14’) and (1.15), and 
it is in fact not difficult to prove directly that the series in (1 .15) converges, 
and defines a stationary Markoy transition matrix function. (An indirect 
proof would consist simply of the remark that there is a unique stationary 
Markov transition matrix function—satisfying the continuity condition 
(1.4)—with the given q;s and q;;s, namely 22, and we have already 
deduced that (1.15) is then true.) A third method, now to be given, is 
important in that it defines the matrix function by defining the corres- 
ponding Markov chains in terms of their sample functions, and thus 
proves an existence theorem for the systems of differential equations 
(1.7) and (1.7) by purely probabilistic methods which are applicable to 
the considerably more complicated integral-differential systems considered 
in §2, To construct a Markov chain with given q,’s and q;,’s we simply 
adapt the proof of Theorem 1.3, as follows. 

Let zı be any random variable taking on the values I, - 5, N only. 
Let 7, be a positive random variable whose joint distribution with 2; is 
determined by setting 


P{7,() > « | z0) = i= ese, a>0. 


(If q; = 0 the interpretation is the obvious one that 7(w) = œ.) If 


Zio Shean tay weet HAVE already been defined, Zn} is a random 
variable assuming the values 1, + - +, N only, whose joint distribution with 
Byer Ene © Seas determined by setting 


Piena) =J |T * s Ta Zi" * s Za = eo) = i) 


and Ta, is a positive random variable whose joint distribution with 
Zirt" "s Zap T * 3 Ta is determined by 
Pirma O) aO) a |r * E Ta i “Zea eT 


Cral) =j, 2> 0). 


We have assumed here that, if z,(w) = i, then q; #0. To complete the 
definition we define 


Phen (0) =i | P CCR tis a AA Zn} =1 @,(@) =i, = 0), 


Pir, (0) = © [tis Pattee Zn} = | Elo) =i, q: = 9). 


$1 MARKOV CHAINS WITH FINITELY MANY STATES 253 


A sequence of random variables 2, 7), 22, To * * * has thus been defined 
inductively, and this will of course be identified with sequence obtained 
in the proof of Theorem 1.4 with (r, +) =2n4,. To do this we note 
first that lim 7, = œ with probability 1, using the proof of the corres- 


n> 


ponding fact in Theorem 1.3, so that if we take Tọ = 0 we can. define 


a(t) by 
u(t, w) = 2,(@) T,-(@) < t <7,(@) 


for 0<1<o. The following argument shows that the 2(1) process 
defined in this way is a Markov chain, and that the given q,’s and q,;'s 
satisfy (1.9) and (1.12). We shall use the fact that, if c is a positive 
constant and if x is a positive random variable with density ce~“(A > 0), 
and if £ is any positive number, 


Plax(o) > 1} = Pie) — E> A | a(w) — £> 0}. 


This trivial fact means that, if in the procedure described above for 
constructing the general sample function we choose some s > 0 and stop 
the procedure when a 7; is reached which exceeds s, so that a(t) is only 
defined for t < s, and if then the defining procedure is recommenced at 
t = s in exactly the same way as it was started at ¢ = 0, using, however, 
the a(s) values as the initial values with the probabilities already found 
for these values, then the following 7,’s and vys will have the same 
distribution as they would have had if the procedure had not been 
interrupted. But then 

(a) the conditional probability distribution of x(t), for prescribed values 
of a(r) when 7 < s, depends only on 2(s), that is, the process is a Markov 
process ; 

(b) the probability Pia(t, w) = j | a(s, ©) = i} is a function of (t — s), 
that is, the process has stationary transition probabilities. 

The continuity condition (1.4) is obviously satisfied and, since the qs 
and qs have exactly the sample function significance of the q;’s and 
qis of Theorems 1.1 and 1.2, these sets of constants can be identified 
with each other. This completes the discussion of the construction of a 
chain with given qs and qjj’s. We remark that the corresponding 
analytical calculations of p(t) would be by means of (1.14) and (1.15) 
or (1.14’) and (1.15), using the fact that 


nPikt) = Plenn (0) = K, 7, (0) < t < Tay) lal) = i}. 


The relation between the differential equation system (1.7) and the 
adjoint system (1.7’) shows up more clearly in the non-stationary case, 
about which we shall therefore make a few remarks. We shall neither 


254 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


maximize the rigor nor minimize the hypotheses in the following. Sup- 
pose that [p,,(-, *)] is a Markov transition matrix function and that there 
are functions 9;(-), qC) for which 


Pils t) =1—G(t—s) + ot—s) i=j 
= qult — $) + oft — 5) iF}, 


so that, by (1.1), 
qi =0, q(t) =9, qt) = 2 40 
j#i 
Then 
Op; iAs, t Pala 
= 


=q iA] 
and 


pills, ate 
meal = qs) i=j 


= qi) iA). 
Taking partial derivatives in the Chapman-Kolmogoroy system of equa- 
tions (1.2) with respect to s, setting $ = t, and then finally replacing the 
pair (t, u) by the pair s, t, we find 
ƏPals, t) 
ðs 
Taking partial derivatives in (1.2) with respect to 4 and setting u = t, we 
find 


RACA 
am PED pas O + F pa aO: 
jPk 


(1.17) = qis)pilss t) — 2, GiskS)PixlSs 1). 
JFL 


The system (1.17) is called the backward system because it involves 
differentiation with respect to the earlier time s, and the system (1.17’) is 
called the forward system because it involves differentiation with respect 
to the later time £. In the stationary case we have seen that the backward 
system (1.7) involves in an essential way the first sample function jump 
after a given time fọ (before time % + t) while the forward system (1.7) 
involves the last sample function jump (after time fo) before time fy + t. 

We have made essential use throughout this section of the hypothesis 
that the chains we have considered have only a finite number N of states. 
The situation is considerably more complicated if there are enumerably 
many states. In this case, as we shall see in the next section, it is charac- 
teristic that for a wide class of cases the backward system of differential 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 255 


equations holds when the forward system does not. This will be related 
to the study of sample function discontinuities, which may be worse than 
jumps. 


2. Generalization of §1 to a continuous state space 


Let X bea linear Borel set. In the present section we consider Markov 
processes {2(t), 0 < t < œ} whose random variables take on values in X. 
In particular, if X is the set of integers 1,- - +, N, the processes are Markov 
chains with N states, the processes considered in the preceding section. 
The choice of X is not critical, in the sense that any larger set will serve 
as well, and the fact that ¥ contains more points than necessary will do 
no harm. For example, if X is the whole line, but, if 


N 
> Pe o) =j} = 1, 


the results of the preceding section hold, whether or not the random 
variables ever actually take on values other than 1, =+, N. If the 
stochastic process under consideration is to be supposed separable, we 
recall from II §2 that it may be necessary to suppose that X is closed in 
the extended infinite line. 

The results of this section are valid with no change if X is‘multidimen- 
sional and with some moderate changes even if X is an abstract space 
(with no topology defined on it), but in the latter case there are certain 
measure theoretic problems that would affect the type of argument that 
could be used in the proofs. 

Rather than supposing that a Markov process is given we suppose the 
stochastic transition functions of the process are given, in the following 
form: 

It is supposed that a function pC, *; *, °) of s, &, t, A is given, defined 
for0<s < t, E e X, A a Borel subset of X, satisfying: 

(a) p(s, +; t, A)isa Baire function of & for fixed s, t, A; 

(b) pls, §; 4°) is a probability measure in A for fixed s, & t; 

(c) pcs +; +, <) satisfies the Chapman-Kolmogorov equation 


p(s, ë; u, A) = | plt, n; u, A)pls, &; t, dy), s<t<u. 
x 
A function p(-,*; *,*) satisfying the above conditions will be called a 
Markov transition function. Tt is convenient to define p(t, ë; t, A) to be 
1 if £e A and 0 otherwise, and we shall use this definition throughout. 
If pC, +; +, *) is a Markov transition function and if p(-) is any probability 
distribution of sets A, we have seen in II §6 that there is a Markov 


256 MARKOV PROCESSES—CONTINUOUS PARAMETER vi 


process {x(t), 0 < t < 00} whose random variables take on values in X, 


for which 
P{x(0, w) € A} = pA) 


Pfx(t, w) € A | x(s)} = p(s, a(s); t, A), 


with probability 1. The exceptional œ set may depend on s, t, A. More 
generally we shall say that any Markoy stochastic process for which the 
latter equation is satisfied in the sense just described is one with p as its 
transition function. 

In the case of stationary transition probabilities, when by definition 
p(s, &; t, A) depends only on t — s, we use the notation 


p(t — s, È, A) = pls, È, t, A), 


and call p(-,*, +) a stationary Markov transition function. 
The Chapman-Kolmogorov equation becomes 


pis + t, È, A) = | pln, Apts, È, dn). 
x 


In the following it will always be assumed that the transition probabili- 
ties are stationary unless the contrary is explicitly stated. The results 
will include those of §1 as special cases, and §1 was written only to 
illuminate the general case of this section. 

We shall say that Doeblin’s condition (D) is satisfied if there is a finite 
valued measure ¢(-) of Borel subsets of X, an £ > 0, and an s > 0, such 
that p(s, £, A) < 1 — eif g(A)<e. It follows as in the discrete parameter 
case (see V §5) that then 


pt,é,A4ySl—e ts 
if p(A) < e, so that to every fg corresponds a » such that 
P(rto, È AVS 1—e 


if (A) < e. Thus each stochastic transition function p(t, ', *) satisfies 
Doeblin’s condition (D) in the sense of V §5. 

THEOREM 2.1 Jf the stationary Markov transition function p(:,:,*) 
satisfies condition (D), then Jan p(t, &, A) exists for all & and A. The 


convergence is uniformly rarena fost. 

The proof of this theorem follows that of Theorem 1.1, which is a 
special case, and only the beginning will be sketched to show that, if 
pl) is the measure function of condition (D), and if N is the largest integer 
< 9(X)/e, then N plays the same role here as the number of states played 
in the chain case of §1. In fact, we have seen in V §5 that for each fo 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 257 


either lim p(nto, £, A) exists for all & A or the stochastic transition 
no 


function p(t, *, *) determines cyclically moving classes of states and that 
because of condition (D) each such class must have p measure at least £. 
If the cycle lengths are d,, d», - - -, we then have > d;e < p(X), so that 


x J 
Zd;<N. Moreover, we showed that if v is an integer divisible by the 
J 

ds, say v = N!, lim p(nvto, £, A) exists for all £, A. Since fy is arbitrary, 


n= 
lim p(nt, £, A) exists for all t, £, A and the proof goes on as in the special 
no 
case Theorem 1.1. 

We shall be interested in the remainder of this section in stochastic 
processes whose sample functions are almost all step functions, and in 
closely related processes. The appropriate corresponding continuity 
condition to impose on the stochastic transition function is 


(2.1) lim Pit, {= 1 


. 


for all £, where {&} is the point set containing the single point €. Among 
other results Doeblin’s result will be‘ proved, that if (2.1) is satisfied 
uniformly in £ the sample functions of a separable process will almost all 
be step functions. Without this added uniformity condition the statement 
is false in general. If X is finite, then if (2.1) is satisfied at all it is neces- 
sarily satisfied uniformly in . In the preceding section we studied the 
case in which X consisted of the points 1, + « +, N and wrote p,,(t) for 
p(t, i,{j}). With this notational change (2.1) becomes (1.4). 
According to the Chapman-Kolmogoroy equation 


pit + 2,84) = | pen Apt, Edi) £> 0, 
x 

and, since when £ — 0 the integrand converges to 0 if y ¢ A, and converges 
to 1 if 7 € A, by (2.1), it follows that 

lim p(t + e, &, A) = p(t, &, A), 

«40 
that is, the stochastic transition functions are continuous on the right. 
Further hypotheses seem to be necessary to obtain continuity on the left. 


It is certainly sufficient if (2.1) is assumed to hold uniformly in & because 
then each integrand in the equality 


plt—e, & A)— pli, & A) = | I1 — ple, m AIPE — e, & a) 
A 


= Í ple, n, A)p(t— £, &, dn) e> 0, . 
KAA 


is uniformly small with e. 


258 MARKOV PROCESSES—CONTINUOUS PARAMETER Vi 


In the following it will be convenient to have uniquely defined con- 
ditional probabilities of the form 


P{A | 2(t, 0) = Ĉ} 


in which A is defined by restrictions on the sample functions for t = to. 
This conditional probability is defined as follows: Define a Markoy 
process with parameter f > fo, defining the X(t) distribution to be con- 
centrated at the point ¢, and using the given transition probability function. 
The probability assigned to A in this way is uniquely defined, and the 
above conditional probability will always be understood as this value. 
It is clear that this definition is legitimate, and we shall use it without 
further comment. 

THEOREM 2.2 If p(-,-,°) is a stationary Markov transition function 
satisfying the continuity condition (2.1), then the limit 
(2.2) eee A = 


t>0 


qE) < «© 


exists for all ë. If q(-) is bounded on a set A, then the continuity condition 
(2.1) holds uniformly on A. If (2.1) holds uniformly on X, qC) is a bounded 
function, and the limit in (2.2) is uniform in &. 

If {«(t), O< t < œ} is a separable Markov process with PC. *s*) as its 
transition function, then, if x(t, œ) = &, it follows that x(t, œw) = & in some 
interval (to, tọ + 7) (where 7 depends on œ) with probability 1. 

Just as in proving Theorem 1.2, which is a special case, it will be 
illuminating to prove this theorem by probabilistic methods, although the 
existence of the limit q(é) can of course be. proved without explicit use of 
probability concepts. Suppose then that {w(r), 0< £ < oo} is a separable 
process as described in the second half of the theorem. We have seen in 
II §2 that there always is such a process if X is a closed set on the line 
closed at + œ. It appears at first therefore that we have imposed an 
extra condition here, but we shall show how to get around this point at 
the end of the proof. Let « and 6 be positive numbers, and let mô be 
the smallest multiple of 6, which is >. Then (see II §2) since (2.1) 
obviously implies that 

p lim a(t) = x(t), 
tht 
we have (since II, Theorem 2.2, is trivially adaptable to the present case) 
(2.3) Pfa(t, œ) =£, 0< t< « |20, w) = £} 


= lim Pfæ(jð, œ) = & j = 1,: -+ m— 1, x(a, w) = Ẹ | 200, œ) = 4 
50 


| 


ll 


lim P{a(jd, œ) = & j =1,+ + +, m | x0, o) = 5}. 
60 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 259 


If the limit is positive, its logarithm is 


iin ming a a tke E E, 
Kes 60 ô 


so that the existence of (ë) in (2.2) is established and 
(2.4) P{x(t, w) = E 0 < t < a |20, w) = s} = a, 


If the limit in (2.3) is 0, then q(€) in (2.2) is infinite and (2.4) still holds, 
with the obvious convention. The evaluation (2.4) implies that 


e < pla, £, {E). 


It follows that the continuity condition (2.1) holds uniformly in oñ any 
set on which q(') is bounded. If (2.1) holds uniformly in & e X, choose 
e> 0 and suppose that « is so small that 


pe, Eales ssageX, 
Then, if 0 = 7 <* * * <7, = % and if ,p(é, Ẹ) is defined by 
pé $) = 1, y= 0, 
= Pia(r,, 0) =E j=% nrl) vel, 
it follows that 


1—e<p(a, é, {8}) 
n-2 
= PE D+ I SpE DMa tray M EDP Tn E d) 


=. (nè) 


n-2 
<ER HE > fE EPa Tn E dn) 


r= (n8) 


< „plé, €) + ell — „pl, I. 
Then 
a= Dl — plé, EE 1 pla & {se 
and, sincé this is true for all choices of the 7,’s, 
(1— el ea 1— pla, & {E S e. 
If e < } this inequality shows that q(-) is a bounded function. Moreover, 
the inequality 


pee f) a a 
(gp a h G a 


x a a 


260 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


valid uniformly in £ for sufficiently small œ, now shows that the conver- 
gence in (2.2) is uniform in &, and even that 


_ 1—pl, & {) 
ae 
ae ag(&) 


uniformly in ¢ (with the obvious conventions when g(&) = 0). 

If g(6) = 0, and a(t, œ) = &, then 2(t, œ) = & for t> tọ with prob- 
ability 1, by (2.4) (remembering that the transition probabilities are 
stationary). If 0 <q(é)< œ and if a(to, œ) = &, (2.4) shows that the 
distribution function of the length of the maximum interval (fọ, fo + 7) in 
which x(t, w) = £ is 


Piro) <A} = 1 ea ASO. 
=0, EN 


Finally we remark that the hypothesis of separability of the process was 
unnecessary in proving the first part of the theorem. This is intuitively 
obvious because the statement of this part of the theorem does not involve 
the process sample functions. However, the proof we have given 
definitely used separability, without which the conditional probability 
(2.3) might not be defined. There are two simple ways of avoiding this 
hypothesis. One way is to replace the left side of (2.3) by 


P{zx(t, w) = &, 0 < t < @ (t rational) | x(0, w) = &}. 


This conditional probability does not involve the concept of separability 
of the process, and the change in (2.3) would require no change in the 
proof we have given [except that of course the same change would be 
made in (2.4)]. A second method if X is closed (as it was in the chain 
case of $1, where we used this method) is to make the process separable 
by changing each x(t) on at most an w set of probability 0 to get a new 
process which is separable (see II §2). This forces no change in the 
transition probabilities of the process. If X is not closed, this second 
method makes it necessary to replace X by its closure X, but this causes 
no difficulty if we set ptt, &, A) = Oif &¢ AC X¥— X and p(t, & {§}) = 1 
if ée% X. 

In the following we shall call any Borel subset of X on which q(-) is 
bounded a q-bounded set. Thus any finite set on which (£) < œ is a 
g-bounded set. 

THEOREM 2.3 Let p(-,:,*) be a stationary Markov transition function 
satisfying the continuity condition (2.1). 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 261 


© IE) = o, then 


os i p(t, £, A) pt, £1, {}) 
eo Pe T= nie ees Ieee) 


if A does not contain &, and if the continuity condition (2.1) holds uniformly 
on A (in particular if A is q-bounded), and if &, # &. 


(ii) If aE) < ©, then the limits 


(2.6) lim eae = q(é, 4) <4(é) 
aD lim plt, En 16) fu DD) (<a) 


exist, if A and &, are restricted as in (i). If B is a q-bounded set, then, for 
each £ e B, the convergence in (2.6) is uniform for AC B— {ë}, and q(&, A) 
is completely additive in A C B— {&}. 


(iii) Let 
G() = L.U.B.q(&, A) (A q-bounded, A C ¥— {9 
A 


if q(é) <. Then q(§)< q(&) and if there is equality at some & the limit 
(2.6) will exist (for that &) uniformly for all Borel subsets A of X— {8}; 
and q(&, A) is completely additive in A. 

(iv) Suppose that {x(t), 0 < t < ©} is a separable Markov process with 
pC, *,*) as its transition function. If q(é) = 0, and if 2(to, ©) = &, then 
a(t, w) =E for t> to with probability 1. If 0 <q(é) <o and if 
(to, ©) = &, there is with probability 1 a sample function discontinuity for 
some t> ty Suppose that 0 <q(&) < © and that AC X— {E} is q- 
bounded. Then the probability is q(&, A)/q(&) that, if there is a sample 
function discontinuity in the finite or infinite interval (to, to + %), there is a 
first, which is a jump, and there are positive numbers 7, < T and a point 
E eA, & #&, all three depending on the sample function, such that 
a(t, w) = Efor O< t <T and x(t, w) = & fort, <t < Ta If0 < q(é) 
= q(é) < ©, there is with probability 1 a first discontinuity after to, of the 
type just described, and the preceding evaluation q(&, A)/q() holds whenever 
A is a Borel subset of X — {§}. 

Before giving the proofs of the several sections of this theorem we 
derive some inequalities which will be used throughout. Let B be a 
q-bounded set. Choose £ > 0, and then « > 0 so small that 


(i— ag) <1—-e <2, FeB. 


262 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


Let A be a Borel subset of B and let &, & be points of X, with 
&,4&¢«B—A. Choose ô 0<6 <a, and let md be the smallest 
multiple of 6 which is >. Then t 


m-l 
28) pnd, fA) > Z pO, EAE | ed pO, & dr) 


A 

m1 
=(1—e) 2, POS; £, {E pd, £, A) 
— pô, £, {5})" 
— pô, £, {Ep 


Note that this inequality is also true for sufficiently small « if A is any 
Borel set, g-bounded or not, on which the continuity condition (2.1) holds 
uniformly. Moreover, 


m-1 
29) pid, én {> > Pn »— 14, En (EDO, Fr DPO, € {5})" 


(RE POE A). 


m-l 
= (1— e)plô, &, (E) 2. pO, &, £83)" 


1 — p(6, £, {&})" 
1— p(6, & {Ep 
Proof of (i) If q(€) = œ, then when 6 — 0 in (2.8) and (2.9) we obtain 


= (1— ep, &, {§}) 


Sm ae Pld, 8 A) 
Be OS, ey Pare BY 

eee Cs 
Perks (= =a) lin r 218) 


and if e < 1 these inequalities imply (2.5) when « + 0. 
Proof of (ii) The case q(é) = 0 is solved by the evaluation (2.4). If 
0 <4q(é) < œ, (2.8) and (2.9) become, when 6 + 0, 


ere plô, £, A) 
6 A) = (1 : 
pla, €, A) > (1—e) 18 lim sup 7 
LEEPER (6, én E) 
{>a ESSE 
PCa, én (E) > (1 — e) @® lim sup = 


These inequalities imply the finiteness of the superior limits on the right. 
Moreover, the first inequality implies that 
tim ing PE 
a0 x 


(1 — e) lim sup? (8, & A) 
per tay) 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 263 


which implies the existence of the limit in (2.6), and in the same way the 
second inequality implies the existence of the limit in (2.7). The first 
inequality now becomes 


ite 
55 A) ce CL Ie) eras 
pla, £, A) > (1— e) 70 


> ag(é, A) — 2neq(§). 


If we apply this inequality to B— A — {£} and combine the two inequali- 
ties, we obtain 


en ee 
~q(&, A) > (1 — e)Pag(é, A) 


— 2e9() L E DEEA 


a 


2 pla, &, B— {&}) 
a 


qÈ, B— {&}) + 2eq(6). 


Then the limit in (2.6) is uniform in A C B— {£} for fixed é. The set 
function q(é, *) is obviously additive. The uniformity of convergence we 
have just proved shows that, for each &, g(é, +) is completely additive for 
sets A C B— {£}. 

Proof of (iii) Let g(§) = q(&); & is fixed throughout the following 
discussion. We shall use repeatedly the fact [see (ii)] that q(¢, +) is an 
additive function of q-bounded subsets of X— {&}, and is completely 
additive on the subsets of a fixed g-bounded set. By definition of q(é) 
there is a sequence Cy, Cy, * ` ` of g-bounded subsets of X — {€} such that 

lim q€, C,) = a6) = 9); 


and we can suppose that C, C C} C. - +, replacing C, by Ù C; if necessary, 
to achieve this. Let C = G C, and let C’ = X— C— {¢}. Then, if A 
is q-bounded and if Ẹ ¢ A, 
qlé, A) = 4, AC) + 9, AC’). 
Now, if q(&, AC’) > 9, it follows that 
gE, Cn + AC) = GE Cn) + 9 AC) > GE) 
for large n, which is impossible. Hence që, AC’) =0 and it follows that 
we can write q(&, A) in the form 
qé,A) = lim ql, ACn), 
if AC X— {£} and if A is q-bounded. We therefore can define g(§, A) 
for every Borel subset A of X — {E} as this limit, without any conflict with 


the previous definition. We now show that, with this definition, (2.6) 
holds uniformly in the sets A considered, and this will imply that 9(¢, °) 


264 MARKOV PROCESSES—CONTINUOUS PARAMETER vI 
(which is obviously additive) is completely additive. Let e be a positive 
number and choose n so large that 

HE) = qE) < qC, Cn) + €/5 
With this choice of n choose 6 > 0 so small that 


es 
a ED l< 


0<t<ô 


| 
PEE AO ee) if 


for all Borel subsets A of X¥ — {E}. The second inequality is feasible by 
(ii) because AC,, is a subset of the fixed g-bounded set C,. Then, if 
0<t <6, 


REED e ty] | AAEAD fe, acy 
EET (8) _ AAA 
t t 
+E — 46 C) 


<6, 
proving the stated uniform convergence. 

Proof of (iv) The first two statements of (iv) follow from (2.4). In 
proving the remaining statements take tọ = 0 as usual. In the following 
let & be fixed, with g(&) < œ, and let A be a Borel subset of ¥— {£}. We 
suppose that A is q-bounded if g(£) <q(é). Choose «> 0, f > 0 and 
let A,,, g^ be the w set for which, for some v, 2< v < n, 


= 

x(t, w) = £, (= 
n 

Syed) ee ee ee. 
n n 


Then, just as in §1, 
alee 
P(A 54) |20, 0) =H = S [e ae E, E an) 
y=2 n 
A 


— e Moe 
> [=] fora qé, dn) (n = œ) 
A 


Ss [ i eaen] oF (B + 0), 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 265 


and the stated results all follow from this limiting equation. (See the 
corresponding discussion in the proof of Theorem 1.3.) 

Example 1 Markov chains with infinitely many states As an example 
of the application of these theorems let X be the set of positive integers, 
with p(t, i, {j} = pit) as usual. The continuity condition (2.1) is then 


(2.10) lim POSL %=1,2,- °°. 
t>0 
According to Theorem 2.2 
1— palt 
lim Pall) =q; í% 
t>0 


exists for all i According to Theorem 2.3, if q; = %, 


SSO) PE 
lim z lim = 0, eaii 
too Palt) too ot H 
On the other hand, if q; < %, 
; AG BaF 
lim pD = Fi pi 
too t 
and 
lim P, At) 
too E 


exist as finite limits with X q; <q: The quantity q(&) becomes È qis 
7 ; 
(identifying £ with i). According to Theorem 2.3 (iv), if È qu = qio then 


Theorem 1.2 is applicable to the infinite dimensional chain. Finally if 
(2.10) is true uniformly in i, then every q; is finite and the sequence {q;} is 
bounded, according to Theorem 2.2, so that every X set is q-bounded, 
Then q; = > qu for all i, The next theorem shows that in this very 


J 
special case the sample functions are continuous except for jumps (if the 
x(t) process is separable), that is, Theorem 1.3 generalizes in this case. 
However, examples will be given of chains with oo > q; = > qu for all i 
j 


and with almost all sample functions having worse discontinuities than 
jumps. According to Theorem 2.4, L.U.B. q; = © in any such example. 


It will be useful below to define at this point the statement “Tq-), g6, DI 
is a standard pair of q-functions.” This statement is to mean the 
following: 

(a) There is a linear Borel set X; q(-) is defined on X and q(-,*) is 
defined for £ e X and A a Borel subset of ¥— {&}. (A Borel subset of X 
on which q(-) is bounded will be called g-bounded.) 


266 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


(b) 4C) is a Baire function; 4C, °) is a Baire function for fixed A, and 
completely additive in A for fixed £. Both functions are finite-valued and 
non-negative. 

(c) qE) = 4(é, X — {&}). Note that, since q is finite-valued, X is the 
union of a sequence of g-bounded sets, so that q(é) = L.U.B. q(&, A), for 
q-bounded 4 C X — {§}. s 

THEOREM 2.4 Let pl,',*) be a stationary Markov transition function 
satisfying the continuity condition (2.1) uniformly in & Then the pair 
q(-), qC, `) defined by Theorems 2.2 and 2.3 is a standard pair of q-functions 
and qC) is bounded. If {x(t), O< t < ©} is a separable Markov process 
with p(-, *, *) as transition function, almost all the sample functions are step 
Junctions. 

Conversely, if (40), qC, °)] is a standard pair of q-functions, and if q(-) 
is bounded, there is a unique corresponding stationary Markov transition 
function satisfying the continuity condition (2.1), and determining q(:), g0, °) 
in accordance with Theorems 2.2 and 2.3. The continuity condition (2.1) 
will then be satisfied uniformly in &. 

If a stationary Markov transition function satisfies the continuity 
condition (2.1) uniformly in €, we have seen in Theorem 2.2 that the 
function q(-) defined in that theorem is bounded. Then q(-) and the 
g(,*) defined in Theorem 2.3 constitute a standard pair of g-functions. 
The proof of the remainder of the theorem follows that of Theorem 1.4, 
and the subsequent discussion, but will be sketched because the ideas will 
be generalized below. Let {e(t), 0< t< co} be a separable process 
determined by the given p(-,*,*) together with an initial probability 
distribution. If æ(0, w) = 2,(w), and if g[z(@)] = 0, then a(t, w) = zw), 
whereas if q[z(@)] > 0 we have seen that there is a 7,(), the first dis- 
continuity of the sample function determined by w, and a 2,(w) # z(w), 
such that a(t, w) = z (w) for 0< t <7,(@), and x[7(@) +, @] = zw). 
If g[z,(w)] =0, set rio) = Tw) = + + + = 0. If g[z(w)] = 0, set 
7(@) = rw) =: + += 0; if gla(w)]> 0, there is a 7,(m) and a 
zal) # za(w) such that x(t, w) = z(w) for r4(m) <t < 7(@), x[79(@)+, ] 
= 2,(), and so on. Here 


Pira (w) —7,(@) > a | Begg SISO SS) Tag My EE, Zai) = e ad a> 0, 
and 
(Zn, A) 
PAOA a a's ea aA yaana, 
{ n+ 1 1 n. qn) 


with probability 1, with the obvious modifications to take care of the zeroes 
of qC). This argument would be correct even if we had supposed only 
that [q(-), qC» *)] was a standard pair of g-functions, and gives (almost all) 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 267 


the sample functions of the process for << lim 7,(w), except at the jump 


n>o 


points. The additional information we have, that qC) is bounded, is now 
used just as in the proof of Theorem 1.4 to prove that lim 7, = œ with 


probability 1, and it follows that almost all sample functions of the process 
are step functions. Conversely, if [4¢), qC, *)] is any standard pair of 
q-functions, and if z is any random variable with values in X, 21, Tiso Tes 
. + are defined to have the distributions written above and a(t) is defined 
by 
a(t, w) = zw), 0<t<7,(w) 


= 2,(w), (w) < t < Tw) 


This definition is effective for O< t < lim 7„(œ) and again, if q(-) is 


n> 


bounded, as we suppose in this theorem, lim 7, = 90 with probability 1, 


nD 
so that x(t) is defined for all t, with probability 1. The same simple 
argument used in §1 in the chain case shows that the x(r) process is one 
with the desired stationary Markov transition function. 

THEOREM 2.5 Suppose that (q(-), qC, *)] is a standard pair of q-functions. 
There is then at least one corresponding stationary Markov transition 
function satisfying the continuity condition (2.1) and related to qC), q(‘.") 
by Theorems 2.2 and 2.3. There is either only one such transition function, 
and in that case the separable Markov processes with this transition function 
have sample functions which are almost all step functions, or there are 
infinitely many such transition functions, and in that case to every transition 
function corresponds some separable process whose sample functions are 
step functions with probability < 1. 

To prove this theorem suppose that [4(°), q-,*)] is a standard pair of 
q-functions and define x(r) as in the proof of the converse half of Theorem 
2.4 for t < lim 7,() =7,(@). We must complete the definition, in case 


no 
T, is finite with positive probability. In this case one way (but not 


necessarily the only way) to complete the definition is the following. Let 
a(?) be any probability distribution of Borel subsets of X, and choose 
Zw a random variable independent of the zs and 7,’s, with the mr") 
distribution. Choose 7,,1; > Tẹ With the distribution determined by 
Pira (0) — Talo) > % [23727 DS ean, on = 0, 
and define a(t, œ) = Zuri) for 7,(@) < t < Twit() We then continue 
as before, letting x(t, œ) go through transitions determined by the 


268 MARKOV PROCESSES— CONTINUOUS PARAMETER VI 


qC, *)/qC) distributions, and determining how to go on at any point like 
7,(w) which is a limit point of jumps by starting off afresh using the (-) 
distribution. An elementary ordinal number argument shows that, for 
any t, x(t) is then defined with probability 1. The x(t) process defined in 
this way is a Markov process with a stationary Markov transition function 
satisfying the prescribed conditions. In fact, the argument used in the 
case of §1 is applicable even to this general case. If lim +, = %© with 


probability 1 for every distribution of 2(0), only one Markov process can 
be obtained in this way, aside from the choice of initial distribution, and 
therefore only one stationary Markov transition function can be obtained. 
In this case almost all sample functions of any separable process with this 
transition function are step functions. On the other hand, if for some 
choice of the initial distribution 
P{ lim 7,(@) = o} =p <1, 

the process finally obtained will depend on the choice of z(-), and for each 
choice the probability that the sample functions are step functions is 
p <1. This finishes the proof of the theorem. 

Suppose now that p(-,:,*) is a stationary Markov transition function 
satisfying (2.1) whose corresponding pair [4(-), 9(-, +)] is a standard pair. 
Then just as in §1 we can compute the probability „p(t, &, A) that (for a 
separable process), if (fo w) = &, then a(t + t, w) € A (where A is any 
Borel subset of X) and that the transition has been effected in n steps, that 
is, that the sample function has exactly n discontinuities in the interval 
(fo to + 7), and that in each of the n + 1 open intervals determined by 
these discontinuities the sample function is identically constant.. Using 
the same argument as in §1, we find 


(11) opté, 4) =0 EGA 
= eit EeA 
t 
mapli, £, A) = f ds f et ,pt—s,m, AE, dr), 


0 x- 
or alternatively 


(2.11%) op(t, , A) = 0, g¢A 
= ee, Eed; 


t 
mapli, & A) = | ds | npls, & ai) | gin, dé). 
0 


x A-A{n)} 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 269 


Then p(t, £, A), defined by 
(2.12) P(t, £, A) = z „P(t, £, A), 
0 


is the probability of a transition from ¢ into a point of A in time ż, the 
transition having been effected in finitely many steps. In particular, if it 
is known that the sample functions are almost all step functions, as is the 
case according to Theorem 2.4 if (2.1) holds uniformly in é, then 
p(t, €, A) = p(t, &, A) for all t, &, and A, Even without this, however, it 
is clear that A(t, £, A) <1; according to Theorem 2.5 there are cases in 
which actually p(t, &, A) <1. Evidently P(t, £X) = 1 for all tand € if 
and only if p(t, & A) = p(t, & A) for all t, &, and A. 

Conversely, if [9(-), 4C, *)] is any standard pair of q-functions, (2.1 1) and 
(2.12) define a non-negative function A(-, :, -) which is a Baire function of 
£ for fixed t, A, completely additive in A for fixed t, &, which satisfies the 
Chapman-Kolmogorov equation, and for which p(t, é,X)<1. The 
analytic procedure of (2.11) and (2.12) can be used to define a stationary 
Markov transition function in terms of a given pair of standard q-functions 
if and only if the g-functions match a separable process whose sample 
functions are almost all step functions. The procedure is therefore 
applicable, for example, if q(-) is bounded, and in this case the procedure 
is simply one analytic form of the probability proof of the converse half 
of Theorem 2.4. According to Theorem 2.5, plc, *,*) can always be 
increased, if necessary, to a Markov transition function, but if an increase 
is actually necessary it can be done in infinitely many different ways, 
leading to different Markov transition functions. 

Suppose now that p(-,*,*) is a stationary Markoy transition function 
satisfying the continuity condition (2.1) whose corresponding pair 
(9(-), qC, *)] is a standard pair of g-functions. Then 


t 
(2.13) pt, & A) = fds f et ple— s, n, Agë, dn) + e OCE, A), 
X—{§} 
where 
4(é,A)=1 eA 
OEELA 


This equation is a generalization of (1.16) and is derived in the same way, 
by considering a separable process with the given transition probabilities, 
and observing that the transition from £ to a point of A can be accom- 
plished either by simply remaining at & (if ë e A) or by remaining at Ë 
through s time units, then jumping to 7 from which there is then a 
transition to a point of A in the remaining time. The rigorous justifica- 
tion of this sort of derivation of equations like (2.13) is the same as that 


270 MARKOV PROCESSES—CONTINUOUS PARAMETER vi 


of similar derivations discussed in §1. This equation proves that 
pC, £, A) is a continuous function of t with a continuous derivative given by 


aptt, EA) _ 
É 


(2.14) 5 


—q ptt, E A) + f pln, AE, di). 
X—{5} 

This equation is a generalization of (1.7). The natural complement of 
(2.13) is the equation for p(-,-, -) obtained by considering the transition 
from ë to A to be accomplished either by simply remaining at & (if & e A) 
for a time interval of length ż or by going to 7 at time s, the time of the 
last jump, jumping to a point of A and remaining at that point through 
some interval of constancy. Since our hypotheses on q(-), qC, *) do not 
insure the existence of such an interval of constancy preceded by a last 
jump, we have described only one way of going from & into A, that is, 


t 
(2.13) ptt, & A) > | ds | pls, & a) f et Gn, dd) 
one: A-A(n) pi 
Wesel A); 
and in fact the same type of reasoning gives the slightly more general 


Plta E, A) — | A pty, E, dr) 
A 


t 

> |ds | peed | hg dd), <te 
ENE A A-A{n} 

If we divide both sides of this inequality by t, — t, and take the limit when 

ty > t, tı > t, we find that, if A is q-bounded, 


optt, §, A) 
ot 


(2.14) > fan X— A)p(t, £, dn) + | q(n, A)p(t, &, dn). 
A X-A 

This is a generalization of (1.7). There is equality here for all and 

q-bounded 4 if the stochastic transition function p(-, *, *) corresponds to 

separable processes whose sample functions are almost all step functions. 

It can be shown that this condition is not necessary, however. 

It is interesting to observe that it follows at once from the probability 
significance of p(-,-,*) that this function, which, as we have already 
remarked, satisfies all the conditions of a stationary Markov transition 
function except that p(t, £, X) is not necessarily 1, and satisfies the 
inequality 

pt, £, A) > ptt, &, A), : 
also satisfies the backward system of integro-differential equations, and 
the forward system (with equality). The fact that p(-, -, -) satisfies these 


§2 GENERALIZATION OF §1 TO A CONTINUOUS STATE SPACE 271 


systems of equations also follows from the fact that (2.11) and (2.11%) 
summed over n yield (2.13) and (2.13’) (with equality) for p. The latter 
pair of equations yields the backward and forward systems of integro- 
differential equations, on differentiation. Thus the backward system has 
infinitely many solutions with the same initial conditions p(0, &, A) 
= 6(&, A) unless p(t, &, A) = p(t, £, A), and these solutions include some, 
like p, which are not stationary Markov transition functions. 

Example | Markov chains with infinitely many states (continued) Ac- 
cording to our results, if {q i > 1} and {qi i j > 1, i # j} are arbitrary 
sequences of non-negative numbers, with q; = > qj; for all i (this is the 


condition for a standard pair of q-functions in the present case), there is 
a corresponding stationary Markov transition matrix function [pu] 
which satisfies the backward system of differential equations (1.7) (without 
the restriction i, j< N of course). The forward system (1.7’) must be 
replaced by a system of inequalities obtained by replacing “=” in the 
system (1.7’) by “>.” On the other hand, the matrix function [/,,(-)], 
where #,,(t) is the probability of a transition from į to in time ¢ in finitely 
many steps, satisfies both backward and forward systems, with equality 
in each, and with the same initial conditions 


Pi) = BO) = 545. 
Moreover, 
Palt) = Palt). 


The following very simple case illustrates these results. Suppose that 


liii = fi 
PA OE a 
Set «(0) = 1 with probability 1, and construct the process corresponding 
to these g,’s as in Theorem 2.5. In this case the differences t,— 7, 
73 — T° * * are mutually independent, and T,,,1 — Tn is a positive random 


variable with density function qe %*, A> 0. Then Efta — Trt = Wn 


1 : ; 
so that, if no q, vanishes and if 5 — < %, it follows that lim 7, < © 


n In n= 
with probability 1. (A more careful examination of the partial sums of 
the series, say by means of characteristic functions, shows at once that the 
sufficient condition of convergence we have obtained here is also necessary, 
but we shall not use this fact.) Suppose that the q,’s are chosen so that 
this series converges. One simple way to define the n(:) distribution of 
Theorem 2.5 in the present case, that is, to define the distribution of 
a(t, +), is to set X(T, +) = 1 with probability 1, so that after any limit 


212 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


point of jumps the sample functions return to the value 1, but of course 
there are infinitely many 7(-) distributions, giving different types of 
sample functions and different transition probabilities to the resulting 
process, The backward system of differential equations is 


Pix t) = — GPikt) + UP) hkz, 


and the forward system of differential inequalities is 


pa (t) > — Paq =l 
Pint) > — pile + PinaOde + Paaa 121 
Kes 


According to the forward system applied to 9,,(t), in which case there is 
equality, a(t) = 0, and it is of course true that no transition from state 2 
to state 1 is possible in a finite number of jumps as we have defined the 


qus. However, if X — < œ, such a transition is possible in infinitely 


many jumps as we chose the 7(-) distribution above. Other choices exist 
(for example concentrating this distribution at the value 2 instead of 1) 
for which such a transition is impossible even in infinitely many jumps. 

We have investigated certain types of Markov transition functions in 
this section, those characterized by the continuity condition (2.1), by 
methods which reveal the probability significance of each step of the 
reasoning. The results not explicitly involving sample functions can of 
course be obtained without reference to the theory of probability. For 
example, let p(-, *,*) be any stochastic transition function satisfying (2.1) 
and suppose that the limits 9(¢), q(é, A) in (2.2) and (2.6) exist and 
determine a standard pair of g-functions. Then p(-,*,*) must have the 
form 

plt, & A) = [1 — tg(E, A)] (6, A) + 198, A — {8}) + 0(0), 


for each and A. Equation (2.14) is then easily derived by an elementary 
manipulation of the Chapman-Kolmogorov equation, and then the study 
of (2.14) and (2.14’) becomes the study of the solutions of these equations 
under the initial conditions p(0, £, A) = 6(&, A). 

If pC, *; ',') is a Markov transition function (non-stationary case) we 
sketch a derivation of the generalizations of (2.14) and (2.14). Suppose 
that 

pis, §; t, A)=1— qi, -= s) Holts) eA 


= q(t, &, A(t — s) + olt — s) E¢A 


where P 
gt, >0, qt, ë, A)=0, qt, E, X— {E = q(t, &) 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 273 
and q(t, &, +) is completely additive in A C X— {ë}. Then 


op(s, £; t, A) | 
Pete | =a) eA 


=—q(t,é,A) §¢A 
=— q(s, £) EcA 
= q(s, $, A) E¢A, 


op(s,.é; t, al 
ot lt=s 


and under conditions we shall not discuss here (2.14) and (2.14’) generalize 
to 


op(s, E; t, A 
77 = (s, p(s, ; t, A) — f Pls, n; t, A)q(s, & dn) 


x-1) 
p(s, E; t, A 
or 


dS — f alts m X— Aps, & t, dr) 
A 


+ | 4t n, ApCs, È; t, dn). 
x-4 

These equations are generalizations of (1.17) and (1.17) and are called 
the backward and forward equations respectively, as are their specializa- 
tions (2.14) and (2.14’). 


3. The diffusion equations and the corresponding Markov processes 


This section is devoted to (real) Markov continuous parameter processes 
which are of the following type: #(t2) — x(t), the increment between times 
t, and ñ, is a sum of small increments de(t), each of which is Gaussian 
with mean mdt and variance o? dt. These two quantities are of order dt, 
and m and ø are functions of ¢ and a(t). This is, of course, a rough 
statement which is only intended to suggest the motivation for the dis- 
cussion to be given. We write 


(3.1) de(t) = mit, x(t)] dt + oft, x(t] dyli). 


Here the y(t) process is the Brownian motion process with variance 
parameter 1 (see II §9 and VIII §2), that is, it is a real Gaussian process 
with independent increments and 


E{y(t,) — y(t} = 0, E{[y(t2) — y(t)P} = |t2— hl. 


The sample functions of a separable Brownian motion process are almost 
all continuous functions, but almost none are of bounded variation in 


274 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


any finite interval (see VIII §2). Equation (3.1) is only to be considered 
suggestive for the moment. It will be given a precise interpretation 
below. 

The material in VII §3, IX §2, and IX §5 will be accepted as known 
throughout this section. 

If o =0 in (3.1) it is natural to interpret the equation as the non- 
probabilistic differential equation 


he 
= = m(t, 2). 
g "OD 
In this case probabilistic concepts can enter only by way of the initial 
conditions. 

If ø = o(:) depends on ¢ but not on @ in (3.1), and, if m = 0, it is 
natural to interpret the equation as a symbolic version of 


t 
a(t) — zlo) = | als) dy(s). 


to 


(See IX §2 for a discussion of this stochastic integral.) The x(t) process 
is in this case essentially the Brownian motion process with a change of 
variable in the time parameter. The transition probability distribution 
function is given by 


g 


i Cae eA ANS << t, 


=o 


1 
VIrA 


P(S, $; t, n) = Pfa(t, w) < y | xls, œ) = 8} = 


where j 


Ae fi ol}? dr. 


It follows that 
Əpls, §3 t, n) als)? pls, È; t, n) 


(3.2) 


Os 2 og 
and 
82) pls, E; tn) _ a(t)? Ppls, & t, n) 
ar 2 One 


The first equation is called the backward equation because it involves 
differentiation with respect to the initial time; the second equation is 
called the forward equation because it involves differentiation with respect 
to the final time. 

We shall not investigate in detail the generalization of the backward 
and forward differential equations (3.2) and (3.2’) for a general m and øv, 
since we are interested in the processes rather than in the differential 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 275 


equations.. We therefore restrict ourselves to the following remarks. 
Under any reasonable interpretation of (3.1) it can be concluded that 


lim E + f — x(t) 


| x(t, œ) = :} = mít, £], 
h40 
(3.3) 


y t + h)— x(t) : 
lim rf m] | x(t, œ) = i) = ot, F. 
hyo h 
Consider now the class of Markov processes for which there are functions 
m(-,*), oC, *) satisfying (3.3). Under various regularity assumptions dis- 
cussed below it is shown that the transition probability function pC, +; *, 9 
satisfies the backward diffusion equation 
dp _ a(s, &)? dp 

=0 
w 2 ee. 


t) . 
(4) AS EAD + ms, €) 


and the forward diffusion equation 


aps, és tn) 2 Palos pane P 
aati z ee MP] 3 dyn lh Ni al I) ov 


(3.4’) 


The forward equation is called the Fokker-Planck equation and is usually 
the more natural equation to consider in physical problems. Kolmogoroy 
derived both equations in the first systematic treatment of this type of 
Markov process. 

Note that the backward equation is a parabolic partial differential 
equation in s and ¢ for s< t; t and y enter only by way of the initial 
condition 

ws, és tm) le =l <n 
='0 Ey: 


The forward equation is a parabolic partial differential equation in ¢ and 
n for t >s; s and & enter by way of the initial condition 


ps és bM haml 1>¢ 
=0 n<é. 


The hypotheses usually imposed to derive these differential equations are 
the following (described only qualitatively) 
F, It is supposed that p(-,°; * *) has appropriate regularity properties 


(differentiability and so on). 
F, It is supposed that the limits in (3.3) exist, and define functions 


m(-,°) and o(-,°) with appropriate regularity properties. 


276 MARKOV PROCESSES—CONTINUOUS “PARAMETER. VI 


F, It is supposed that, for every e > 0, 
d, p(s, £; t, n) = P{|x(t, œ) — zls, w)| > e | 2(s, œ) = &} 


In—l>e 
= o(t— s) Sah. 
[It is also possible to rewrite (3.3) using truncated variables to avoid the 
hypothesis that the first and second moments of x(t) — 2x(s) exist.] 
The condition F, is not satisfied for the processes discussed in §1, for 
which the probability 
P{æ(z, w) A a(s, w) | x(s)} 


is in general of the order of ż— s. The sample functions of the separable 
processes of §1 are typically step functions, progressing by jumps (although 
this is not true in all cases). The sample functions of the processes under 
discussion here are typically continuous, but there are exceptions to this, 
and as in §1 it is to be expected that in a large class of these exceptional 
cases the backward equation holds but the forward equation must be 
replaced by an inequality. The theory is still incomplete on this point 
as On many others relating to these processes. 

In the present section we shall adopt an interpretation of (3.1) which 
makes it possible to solve this equation and thus find a separable Markov 
process satisfying (3.3), almost all of whose sample functions are con- 
tinuous. It will then be shown that conversely, if m(-,-) and o(:, +) are 
given functions, any Markov process satisfying (3.3), which has-continuous 
sample functions with probability 1, can be obtained as a solution of (3.1). 

It has been shown by Feller that under suitable restrictions on m(-, ) 
and o(-, -) the diffusion equations with the initial conditions stated above 
can be solved to give the stochastic transition function of a Markov 
process satisfying (3.3), and that this solution is unique. Moreover, 
Fortet has shown that under Feller’s conditions (at least if ø = 1) almost 
all sample functions of the corresponding separable processes are 
continuous, 

According to the remarks in the preceding paragraphs the processes 
obtained by Feller must be exactly those found by solving (3.1) [where it 
is supposed that Feller’s conditions on m(-,-), oC, +) are strengthened if 
necessary to agree with those which must be imposed in the discussion of 
the preceding paragraph]. 

The results discussed in the preceding paragraph imply that if m(-, +) 
and o(-, :) are sufficiently well-behaved the stochastic transition functions 
obtained by solving (3.1) must be the same as those obtained by solving 
the Kolmogorov-Fokker-Planck diffusion equations, and therefore these 
stochastic transition functions must have various partial derivatives. No 
direct proof of this fact has yet been given. 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES ENGA 


Finally we remark that the processes discussed in this section and the 
preceding one are special cases of a type which includes them both, whose 
stochastic transition functions satisfy integrodifferential equations 
obtained by combining those obtained in the preceding section with the 
Kolmogoroy-Fokker-Planck equations. We omit any further discussion 
of this general class. 

We now proceed to a discussion of (3.1). Let the range of t be the 
finite interval [a, b]. The natural interpretation of (3.1) is 


t t 
68.1) x(t) — x(a) = f m[s, æ(s)] ds + f ols, x(s)] dy(s). 
This equation must be solved for an x(t) process for which the two 
integrals on the right are meaningful. Now for any sample function of 
any a(t) process the first integrand becomes an ordinary function of s, 
and the usual criteria of integrability are applicable. The second 
integral has been defined (see IX §5) if, for example, off, x(to)] is for 
each fọ a random variable independent of the aggregate of differences 
fy(b) — y(s), S> to}. Fortunately the intuitive picture of the x(t) process 
which makes æ(tọ) the sum of x(a) and suitably transposed and scaled 
y(s) increments dy(s) with a < s < to matches this restriction. Thus the 
interpretation of (3.1) as (3.1) becomes a practical possibility. We 
proceed to carry it out in detail, following Ito. 

We make the following hypotheses: 

H, m(-,:) and o(-, *) are Baire functions of the pair (£, )fora<t<, 
—-a<&< 0; 

H, There is a constant K for which 


|m(t, &| < KA + geile 
o< olt, E < KU + E. 
H, m(,') and of, +) satisfy a uniform Lipschitz condition in &, 
|m(t, &) — m(t, &) < K|é.— ĉl 
Jolt, &) — o(t, &)| < K|&— él 


where K is independent of t and &. It is no restriction to suppose this K 


to be the same as the one in Hy. $ 
Assuming hypotheses H,, Hə, and Hy, we shall find an a(t) process with 


the following properties: 2 
P, The 2(t) sample functions are almost all continuous in [a, b]. 


b 
P, [ECO dt < o. 


278 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


P, For each fy e (a, b), #(tp) — x(a) is independent of the aggregate of 
differences {y(b) — y(s), s > t}- 

P, For each re [a, b], (3.1’) is true with probability 1. 

It will be shown that the x(t) process is essentially uniquely determined 
by these properties, and even satisfies 


Py: E{ Max z(t, w)?} < oo. 
> astsb 


Lema 3.1 Jf an x(t) process has properties P,, Ps, Pa, and if mC, `), 
o(-, *) satisfy conditions Hy, Hy, Hg, then any &(t) process defined by 


t t 
(3.5) A(t) = J mis, x(s)] ds + J ols, x(s)] dy(s) 


has properties Py and Ps. The second integral in (3.5) can be defined for 
each t in such a way that the x(t) process also has property P,, and the a(t) 
process will then have property Py. 

According to H,, H}, and P, the first integrand in (3.5) is for almost 
all sample functions a bounded Baire function of t. Hence the first 
integral defines a continuous function of t, with probability 1. The 
second integral is a special case of the stochastic integral of IX §5, because 
the qualitative conditions imposed there are satisfied, and because 


t b 
| E{o[s, «(s)}°} ds < K? | [1 + Efe(s)}] < 00. 


a a 


The second integral in (3.5) is thus well-defined, and is uniquely defined 
for each z, neglecting values on sets of probability 0. The x(t) process 
obviously satisfies P; for any choice of these integrals. Property P, is 
easily verified by a direct calculation, but it will follow from the fact to 
be proved below that even Py is true if these integrals are chosen properly, 
so the calculation will be omitted. Now, it is shown in IX §5 that the 
second integral in (3.5) defines a martingale as £ varies, and moreover that 
it can be defined for each ¢ to get continuous sample functions, with 
probability 1. With this definition the ĉ(r) process has property Py. It 
will then have property P, because 


t b 
E{ Max | | mls, 2(5)] ds |} < Eff f Imis, x(s)]] dsi?) 
astsb 4 F 


b 
< (b— aK? | E{1 + æ(s)?} ds < 00, 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 279 


and by the continuous parameter version of VII, Theorem 2.4, applied to 
the absolute value of the last term in (3.5), which determines a semi- 


martingale, 
t b 


E{ Max | | ols, 2(5)] dy(s) |} < 4E | ofs, 201 do) 


| 
astsb 


b 
= 4 | Efols, 2()}?} ds 


b 
< 4K? f [1 + Efa(s)3}] ds < 00. 


Equation (3.1’) is solved by successive approximation. Let 2(a) be any 
random variable with E{x(a)?} < œ, independent of the aggregate of 
differences {y(t,)— y(t), ty t2€[a, b]}. Let the x(t) process be any 
process with properties P}, Pz, Ps. [For example, take x(t) = 0.] Ac- 
cording to the lemma it is then possible to define v,(t) forn > 1 inductively 
by t t 
(3.6) aO = x(a) + | mls, taads + | of ey x68)] ays) 

a a 


in such a way that every z,(¢) process has properties P4, P,’, Ps. We shall 
prove that with this definition 
(3.7) lim x,t) =x), a<t<b 


n>a 
uniformly in ¢ with probability 1, defining an x(t) process with properties 
P,, P,’, Pa and that 
t t 
lim | nfs, &,(s)] ds = | nfs, x(s)]}ds ax<t<b 


n>o 3 4 


(3.8) 
t t 
tim f ols, (1 dy(s) = | ols, 21 dys) 
a a a 
uniformly in ż, with probability 1. The x(t) process will then be a solution 
of (3.1). To prove these facts we make the definitions 
A, a(t) = (0) — & a) 
A,m(t) = mt, x,(t)] — mit, x, 4(t)] 
A, a(t) = oft, £C] — off, ta), 
so that, by Hg, 
JAn] < KAO [Ano K|A„æ(0)|. 


280 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


Then 
E((A, 20} < 2E{| f An-ni) ds} + 2E(| f Anos) dyo) 


<2K%b—a+1) ! E((A, .s)P}ds, n> 1. 


Hence 
(3.9) E{{A,2()P} < [2K%b— a + 1)" j$ = = * EAP } ds 
ay eat Sh, 
n! 


for some constant c. Using this inequality, 


t b 
P{ Max |f Aym(s) ds | = 27} < P{ | K|A,e(s)| ds > 2-") 
astsb a a 


b 
< 4E(I | K|A„2(s)| ds} 


Ze 4%(b — a)K*c" 


n! 


Since the last term is the general term of a convergent series, 
t 
(3.10) Max | { Am) ds| <2-" 
astsb a 


for sufficiently large n, with probability 1 (Borel-Cantelli lemma, HI, 
Theorem 1.2). According to IX §5, the process 


t 
{f Ano(s) dy(s), a< 1 <} 


is a martingale. Then the family of squares of these random variables is 
a semi-martingale, to which the continuous parameter version of VII, 
Theorem 3.2, is applicable. In view of (3.9), we obtain 


t t 
P{ Max | | A,o(s) dy(s)| > 2} < 4E f Anos) dyt) 
astsb a a 
t 
= 4" f E{[A„o(s)}} ds 


b 
nk? n 
< 4"K? | B(A aP} ds < IEE, 
J I 


$3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 281 


Since the last term is the general term of a convergent series, 
t 

(3.11) Max J A,,a(s) dy(s)| <2-" 

astsb a 
for sufficiently large n, with probability 1. According to (3.10) and 
(3.11) the integrals on the right in (3.6) converge uniformly in t, when 
n— œ, with probability 1. Hence the limit in (3.7) exists uniformly in 
t, with probability 1. The a(t) process defined in this way obviously has 
properties P, and P;. For each f, if n>m, 


GID BOD- 2,08 = ES AO 


< 32" 5 EALO 


m+1 m+1 


~ (2, j 
aoe 52 
1 FA 
Then, for each ż, Lim. 2,(t) exists and the mean limit must be a(¢) since 


0 (m — ©). 


mean limits and probability 1 limits must coincide (with probability 1). 
Thus when n —> œ in (3.12) we find that 
æ (2c)! 
i! 


E{fa(t) — Emt) < 2-™ 2 i! 


so that the a(t) process has property Py. Finally (3.8) is true uniformly 
in z, with probability 1 [so that (3.1’) is true] because there is uniform 
convergence of the integrands with probability 1 in the first limit equation 
and because, applying the continuous parameter version of VII, Theorem 
3.2, as above, 


P| Max if (als, x(s)] — ots, «,(8)1) dy(s)| > 1} 


b 
< rè | E{(ofs, 9] — ols, x4(9)])*} ds 


b 
< Ke? | E{[2(s) — x,(s)P} ds 
o (2c) 

2y29—n poset 
< K?n?2 2 site 


Hence, according to the Borel-Cantelli lemma, the maximum involved 
here will be < 1/n for sufficiently large n, with probability 1, so that the 
second limit equation in (3.8) is true uniformly in ¢, with probability 1. 


282 MARKOV PROCESSES—CONTINUOUS PARAMETER vı 


The proof of the existence of a solution to (3.1’) is now complete. The 
solution (satisfying P,, Pa, Ps) is essentially uniquely determined by x(a). 
In fact, if A(t) is the difference of two solutions with the same a(a), the 
argument leading to (3.9) shows that 


AMS 5, a<t<b, 
iM 


so that A(t) = 0, with probability 1, for each z, and in view of P, we then 
have 
P{A(t, o) = 0,a<t<b}=1. 


Any solution of (3.1’) satisfies 


es t 

(3.13) x(t) — a(r) = | mls, 2(s)] ds + | ols, 2(3)] dy(s). 

We shall always suppose, as we have above, that x(a) is a random variable 
independent of the aggregate of differences {y(te) — y(t), ty ta € [a, bl} 
This property reproduces itself in the sense that 2(r) is then a random 
variable independent of the aggregate of differences {y(te) — yl), to 
ta € |7, b]}. According to (3.13) a(t) depends only on æ(7) and the y- 
differences for arguments between 7 and ¢. The latter differences are 
independent of x(7), x(a), and (if s < 7) the y-differences for arguments 
between a and 7, upon which 2(s) depends. It follows that the conditional 
distribution of a(t) for {æ(s), s < 7} given is a function of (7) alone. In 
other words, the x(t) process is a Markov process, and the conditional 
distribution of 2(r) for (r, w) = & is the distribution of the solution of 
(3.1) with a = 7 and P{a(a, œ) = } = 1. Itis not a priori obvious that 
this uniquely defined conditional distribution satisfies the Chapman- 
Kolmogorov equation identically, that is, without the necessity of the 
exclusion of sets of probability 0. The following argument shows that 
the Chapman-Kolmogorov equation is satisfied identically in the present 
case: We shall distinguish the probability distribution obtained in (3.1’) 
when a = 7 and Pf{2(a, w) = &} = | by the subscripts 7, &. Suppose that 
7<s<t, Then 


P, dælt, w) <A} = E, {P,, {a(t o) <A | 2(s, w) = n} 
=E, {P,, et, w) <A}}, 


where we have used the fact that the process is a Markov process with the 
stated transition probabilities for any initial conditions (independent of 
later dy’s). This equation is precisely the Chapman-Kolmogorov 
equation. 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 283 


It will be useful below to have an evaluation of E{ Max |a(s) — 2(a)|?}. 
assst 


Although this can be obtained from the preceding work, it is more in- 
structive to obtain it directly. In the following, Kj, Ky, K, will denote 
constants whose choice will depend only on K and (b— a). We have, 
using the inequalities obtained in the proof of Lemma 3.1, 


t 
E{ Max |2(s) — 2(a)|?}< 2E {Max | | mls, 2(3)] ds} 
assst asst a 


t 
+ 2E{ Max | J ols, a(s)] dyo 
a 
< [Kb — a) + 4K?) | (1 + Efe) ds 
G t 
< Kilt — af + Efa?) + K, [Ef|x(s) — e} ds. 


If now the left side of this inequality is replaced by E{|æ(£) — æ(a)|?}, the 
inequality can be integrated to obtain 

t 

| E{|x(s) — z(a)|?} ds < K,(t — a)*[1 + Efx(a)"}] 


a 
and the original inequality then becomes 


(3.14) E{ Max |2(s) — 2(a)|?} < K3(t— @)[1 + E{x(a)"}]. 
assst 


We remark that the expectation may be replaced by a conditional expecta- 
tion for a(a, œ) = £ in this inequality, and of course a may be replaced 
by any other point of the interval [a, b]. This replacement will be made 
as required without further comment. 

The structure of the a(t) process in the small is now easily found. We 


write 
t+h 


(3.15) a(t + h)— x(t) = i) [m[s, a(s)] — m(s, £)] ds 
t 


tth tth 


+ | lots, 2(8)] — ofs, EN dy(s) + f mts, £) ds 
t 


tth 


+ | ols, ) dys), 


284 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


and consider the distribution of a(t + A) — x(t) for a(t, w) = €. The sum 
of the last two terms is Gaussian, with mean and variance 


t+h tth 


j m(s, &) ds, f o(s, &)? ds 


t 


respectively. Moreover, this Gaussian random variable is independent 
of the past, that is, of the aggregate of random variables {x(7),7<t}. If 
the functions m, o are continuous, the above mean and variance are 
hm(t, £), ho(t, £)? aside from o(h) error terms. Even without the continuity 
hypothesis the statement is true for each é for all values of s except those 
of a set of Lebesgue measure 0. On the other hand, the first two terms 
on the right in (3.15) are of order o(h) in the sense that using (3.14), as 
adjusted to the present situation, 


(3.16) E, { Max | | Ents, 291 — mits, 8] ds|%} 
Srstth t 


tth 


< E, l | |mls, 20] — mts, 8)] ds?) 


t+h 


< Kt | E, e{le(s)— lê} ds 
t 
< KKU + E), 
and using the continuous parameter version of VII, Theorem 3.4, 
(3.17) E, el, M : | | [o[s, x(s)] — als, £)] dy(s)|?} 
STstth 4 


tth 


<4E, {| | lols, a(9)] — ots, 8] dy) 
t 


tth 


= Al J E, {lots, 2(s)] — als, £)|?} ds 
t 


t+h 


<4k? Í E, -{|x(s) — El®} ds 
t 


<4K2Kyh(1 + &). 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 285 


The above results imply that, for each &, 
t+h 
E{x(t + h) — x(t) | a(t, œ) = & = | ms, &) ds + (1 + EDO) 
t 
(3.18) 
tth 


Efel + A) — (OP | a(t, w) = E} = | ols, 8% ds + (1 + VOU), 
t 

where the O( ) terms are uniform in ¢t and . These equations make 
(3.3) precise for the solutions of the stochastic differential equation (3.1). 
We observe again that for each & the first terms on the right are 
h{m(t, £) + o(1)] and h{o(t, £)? + o(1)] if m and o are continuous in ¢, and 
without the continuity hypothesis the statement is still true (for each &) 
for all values of t except possibly those in a set of Lebesgue measure 0. 

Finally we remark that, if e > 0 and if A is so small that 

teh 


| ji m(s, &) ds | <7 
t 


(3.15) yields, in view of (3.16) and (3.17), 
16h K?K (1 + &) 
(3.19) Pfla(t + h, w) —alt, o)| > e (xlt, w) = £} < —— s = 
>-). 
The last term is 


7 R m E ah 5 u efaa ih -1/2 
Bfe f as| >i [ma +e 
A t 


According: to VIII (2.2) this integral is at most 
27”) 3-3 
PhS et 
A ae Nar ae 
Hence we have proved 


(3.20) Pilate + W a| > e Jolt, o) = SC + &)820(h8%), 


where O(h?'2) is uniform in € and t. We even have the stronger inequality 


tth 


i ols, & dyl) 


t 


64K?K h(l + & 
Paica eis 


& 


68.20) P{ Max |a(r)—a«(1)| > e |a(t, w) =§ <0 + &)8/20(13)), 
t<rstth 


In fact, the evaluations we have given prove this inequality also, if (using 
VIII, Theorem 2.1) the majorant of the last term in (3.19) is doubled. 


286 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


We have thus verified that the solutions of the stochastic differential 
equation (3.1) satisfy conditions of the same character as F, and F, 
discussed at the beginning of this section. We have already remarked at 
the beginning of this section that conditions Fj, Fs, and F, imply, at least 
in a wide class of cases, that the sample functions of corresponding 
separable processes are almost all continuous. Thus it appears that we 
are discussing here the same general class of processes as that discussed 
in the theory of diffusion. This fact explains the title of the present 
section. 

We observe that if (3.1’) were replaced by 


't t 
G1) alt) = 20) + | mls, 29] ds + | ols, 2(9)] dys), 


where the #(t) process is a given process with continuous sample functions, 
b 
with f E{č(t}}dt < œ, the existence and uniqueness proofs, would 


a 
require no change. The only difference would be that Py would no longer 
be satisfied unless E{ Max &(t)?} < œ. The x(t) process is a Markov 


a<t<b 

process if for each r the aggregates {2(s), a < s < t},{y(b)— y(s), t < s < b} 
are mutually independent. In particular, if #(z) does not really depend on 
t, say &(t) = č, we have č = x(a) and the special case (3.1’). If o = 0 in 
(3.1”) and if, for each f, #(¢) is identically a constant, the probability 
element of this study disappears. Our work is still applicable, however: 
it proves the well-known fact that for every continuous f-function &() 
there is one and only one continuous t-function x(-) satisfying 


t 
Bah a(t) = ale) + | mils, 269) ds. 


The work of this section has been devoted to the solution of the 
stochastic differential equation (3.1). We now consider the converse 
problem: What a(t) processes can be written as solutions of this equation ? 
This problem will be treated in several stages. 

THEOREM 3.2 Let {a(t), a< t< b} be a separable stochastic process 
with the following properties: 


(a) the process is measurable; 


b 
b) E(D} < 0a < t< b; | EO} dt < w; 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 287 


(c) there is a Baire function m(-,*) with 
|m(t, 8| < KA + Ee, 


for some constant K, such that, if a < t; < t, < b, 


ty 
Efel) — a(r) 120), 1< 4} = Bf | mils, 26) ds |a(0), t< f} 
t 


with probability 1. 
Then the &(t) process defined by 
t 
#1) = x(t)— i mis, 2(s)]ds, a<t<b 
2 "i 
is a separable martingale, and in fact, if ti < t» 
(3.21) E{X(t.) | a(t), t < h} = &(4) 
with probability 1. 
According to our hypotheses 
t ts 
Effe) a)l} < E f mls, 2(s)] ds} < KEE f [1 + eO] 45} 
ù ù 
so that the two sides of the equation in (c) are well defined, and 
E{|%(1)|} < œ for a< t <b. Moreover, 


E{®(tq) — Hq) L2), t< h} ; 
= Efe(t,) — a(t) | 2), t < 4} — Ef J mfs, x(s)] ds | (1), 1< 4} 
ti 
=0 


with probability 1, so that (3.21) is true. This equation implies the 
martingale property for the #(t) process. Since the w(t) process is separ- 
able, and since the integral in the definition of (t) is a continuous function 
of t for almost all sample functions of the æ(t) process, the Ž(r) process is 
also separable. We observe that this theorem implies that the sample 
functions of the 2(t) process are (almost all) continuous except for jumps, 
since this fact is true for the (r) process sample functions. 

THEOREM 3.3 Let {æ(t), a < t < b} be a real stochastic process. Let 
F , be the smallest Borel field of w sets with respect to which the x(s)'s with 
s<t are measurable. Let m(-,:), o(,*) be functions of t, &. The 
following hypotheses are made. 


(a) Almost all sample functions of the process are continuous in [a, b]. 


288 MARKOV PROCESSES—CONTINUOUS PARAMETER Vi 
(b) E{a(t)?} < co for t e [a, b], and, ifs <4, 
Efa(t)? |F} <2, 


with probability 1, where, for each s, z, is a random variable measurable 
with respect to the field F ,, does not depend on t, and E{z,} < œ. 

(c) There is a monotone non-decreasing function fO, with lim f(h) = 0 
such that for each t and h witha < t < t + h < b a 


t+h 
| Efa(t + h) — a(t) | F}— | mis, x(t)] ds | < [1 + a(t)" Jhf(h) 
t 
tth 
| Effet + h) — a(t) P | Fi — i ols, ()}? ds | < [1 + a(t)? JA/(h) 
t 
with probability 1. 
(d) mC, *), oC, *) are Baire functions, continuous in their second variables, 
and there is a constant K for which 


|m(t, &)| < KU + £? 
0< olt, &) < KU + £. 


With these hypotheses it follows that the x(t)— x(a) process is a Markov 
process. If o(t, €) vanishes for no t, £, there is a Brownian motion process 
{y(t), a < t < b} such that x(t) is a solution of the stochastic differential 
equation (3.1). If o(t, €) may vanish, the statement remains true if a 
Brownian motion process is adjoined to œw space. 

See II §2 for the significance of the adjunction of a process to the given 
w space. 

Note that, if in addition to the conditions imposed in this theorem 
m(:, *) and o(-, +) satisfy uniform Lipschitz conditions in &, then it follows 
from our discussion of the stochastic differential equation (3.1) that this 
equation can be solved to get an x(t) process uniquely determined by x(a) 
and a given Brownian motion y(t) process, for which the hypotheses of 
the present theorem are true with 


2, = 2K (b— al + 2) + 2x2 


[see (3.14)] and f(h) = const. h}2?. Thus the conditions under which a 
given process can be written as a solution of (3.1) are less stringent 
than the conditions under which we have proved that (3.1) can be 
solved. 

We show first that the hypotheses of Theorem 3.2 are satisfied. Only 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 289 


the third one, the one involving m(t, £), is not obvious. To prove this 
one let ty = So <5 <* °° < Sn = ta Ô = Max (sS; — 83). Then 
j 


te 
| Eelt) — a(n) |F} — Ef | mls, #9] ds 174) 
4 


| 


: |e (> [een — x(s;)]— T mis, æ(s)] a| | #,)] 


=0 a 


AN 


Lapa if [mis 219 — mts, tsp] a |7) | 


n-1 
at 2 (S41 — 5) f(Si4 — 8) [L + E{a(s,)? |F a)l 
j= 


with probability 1. It is sufficient to prove that the last two terms in 
this inequality can be made arbitrarily small by choosing the s,’s properly. 
The last term is dominated by 

(b— a) fo) + 2a) 
with probability 1, and this goes to 0 with ô. The preceding term has the 
form E{p | Fn} It is, therefore, sufficient to prove: 

(i) p + 0 when 6 — 0 if the s,’s are chosen properly; 

(ii) E{g? | F} is bounded independently of ô by some finite œ function. 
The first statement is true (with probability 1) because m(t, *) is continuous 
and almost all sample functions of the æ(t) process are continuous. As 
to (ii), we need only remark that 

n—1 Sst 
Wax f [i++ ast] æ 
$=0 3 
so that , 
t n-1 
PL 2K%X(t,— h) | if [1 + a(s)*] ds + Pe IEZ OLOTE s)| 
É a= 


and hence 

Ey’ | Fi} <4K%(t,— WP + 2) 
with probability 1, so that we have obtained the desired bound and 
thereby shown that Theorem 3.2 is applicable. Applying this theorem, 


we define an 2(f) process by 
t 


XA = a(t)— i mis, 2(s)] ds, 


a 
so that the £(f) process is a martingale whose sample functions are almost 
all continuous. 


290 MARKOV PROCESSES—CONTINUOUS PARAMETER VI 


Define c by 
b 


b 
c =E {[ | |mls, 49] d} < (b — a)KE{ | [1 + 2(5)*] ds} 
< (b— aP K{1 + Efz,}] < œ. 
Then 
Ef@(1)?} < 2Efx(1)?} + 2¢ < ò 


so that, using the continuous parameter version of VII, Theorem 3.4, we 
obtain 
E{ Max a(t)?}<2E{ Max 2(1)?}+ 2c 
astsb astsb 
< 8E{%(b)?} + 2c < oœ. 
If t; = 5 <5, <* + + <5, = tf. we have, from (3.21), 
E{H(s,) | F}= Hs)  j<k 

with probability 1. Hence 

E{i(s,)%(s,) |F} = Hs)? j <k 
with probability 1. Using this fact, 


| ELEC) — ADE |F} — Ef l ols, x(s)}* ds |F} | 


(> 
(3 


z lsa) TA > i afs, æ(s)}? a| 7 a)l 


e 
<|e J tat 24) — ols, (5,9) ds |7) | 


+ 3 (S544 — 55) f(Sjr—Sy)[L + Eels)? | Fh] 
with probability 1. We prove that 
ty 
62D, BUR) HPF} = Ef | ols, xP ds | F,)} 
ti 


with probability 1 by proving that the last two terms in the preceding 
inequality can be made arbitrarily small by choosing 6 = Max (s;,, — $;) 
small. We have already treated the last term. The next to the last has 
the form Efy |F, }, where y > 0 when 6 — 0, for almost all œ. It will, 
therefore, according to I §8, CE;, be sufficient to show that |y|< py 


§3 DIFFUSION EQUATIONS AND MARKOV PROCESSES 291 


where y; is a function independent of ô, with Efy,} < co, with probability 
1. Now 


fs n=] 
va f ols, m9 ds + 3 olsy 26s s) 
A =0 


ta n 
< K? | [1 + x(s)] ds + K? 2 [1 HENS — 5)) 


ti 


<2K%(tp—%) Max [1+ æ()]= yr 
ast<b 


and we have seen above that Efy,} < œ. Now suppose that a(t, €) 
never vanishes. Then since (3.21) is true there is, according to IX, 
Theorem 5.3, a Brownian motion process {y(t), a < t < b} such that 
t 
Z) — Ha) = | afs, 2(3)] dys), 
so that 


p t 
x(t) — z(a) = | mls, 269) ds + | ols, 2(5)] dyo). 
a a 
If o(t, E) may vanish, the same theorem states that (t) can be put in this 
form after a Brownian motion process is adjoined to œ space. In either 
case it follows that the æ(r)— æ(a) process is a Markov process, and the 


theorem is now completely proved. 
> 


CHAPTER VII 


Martingales 


1. Definitions; martingales and semi-martingales 


A martingale was defined in II §7 as a (real or complex) stochastic 
process {x,, t e T} for which E{|z,|} < co, t e T, and 


(1.1) Gs Ris Eft, EZA var le x} 


with probability 1, whenever f <: ** < taşı: Here n is an arbitrary 
positive integer. It follows that, if a process {x, t e T} is a martingale, 
every process {x,, t € T,} with T; CT is also a martingale, and conversely, 
if the latter process is a martingale for every finite set T, CT, then the 
process {a,, t e T} is a martingale. 

In the following, if {x,, t e T} is any stochastic process, and if A is an 
@ set measurable on the sample space of the «,’s (see II §2), we shall 
attempt to increase the intuitive content of the discussion by describing 
A as a set determined by conditions on the xs. Thus, if %, ` * +, %, 
‘are random variables, an w set is a set determined by conditions on these 
random variables if and only if it is an œ set of the form TEACO * *, 
x,(«)] € B}, where B is a Borel set in n dimensions (real or complex as the 
case may be), or if it differs from such a set by one of probability 0. 

Going back to the definition of conditional expectation, and changing 
the notation, (1.1) is equivalent to 


(1.2) [aap = | 2,aP, s<t 
A A 

for every œ set A determined by conditions on a finite number of «,’s 

with r<s. This equation is then also true for every œ set determined 

by conditions on the x,’s with 7 < s, since the latter can be approximated 

arbitrarily closely by the former ones (or apply Supplement, Theorem 2.1). 

Equation (1.2) is equivalent to n 


y w Ele, lara SSN 
where this equation is to hold with probability 1 for each pair s, £. The 


equality (1.1’) is sometimes used as the defining property of a martingale 
292 


§1 DEFINITIONS; MARTINGALES AND SEMI-MARTINGALES 293 


instead of (1.1), The equality (1.2) will be called the martingale equality. 
If the xs are complex, and if (1.2) is true, it is true for the real and 
imaginary parts separately. Then if an x, process is a martingale, the 
processes defined by the real and imaginary parts of the x,'s are also martin- 
gales. This fact will be used below to reduce theorems on complex 
martingales to theorems on real ones. 

Example 1 (See also Example 1 of II §7.) Let z bea random variable 
with E{|z|} < oo. Suppose that T is a linear set, and that to each te T 
corresponds a Borel field F, of measurable œ sets, with F, C.F, when 
s<t. Define 

x, = Efe |F}. 


The process {a,, t e T} is then a martingale, and even the process obtained 
by adding another parameter value to T, to the right of the given para- 
meter values, and making z correspond to this parameter value, is a 
martingale. We can prove this statement by using the rules of combina- 
tion of conditional expectations. Since this method has already been 
used in the particular case of Example 1 of II §7, however, we prove the 
result this time by proving the martingale equality directly; that is, we 
prove that, if s < t, and if A is an w set determined by conditions on a 
finite number of 2,’s, with r= s, then 


fear = f zdp, 

A A 
and that moreover this equation remains true for all s € T if æ, is replaced, 
on the right by z. The following proof is applicable with or without this 
replacement. The œ set A is either itself in F , or at least differs from 
such a set by a set of probability 0. Hence by the definition of conditional 
expectation (or the identification of x, with 2) both sides of the above 
equation are equal to 

fear. 


A 


Hence the desired equation is true. 

As a particular case of Example 1 we obtain Example 1 of II §7, as 
follows. Letz be as above, and let wi, Wa * + * be any random variables. 
Then, if a, is defined by 


z, = Efz pr +s Wah 


the random variables 2, ta * * *, Z constitute a martingale. In fact if F, 
is defined as the Borel field of œ sets determined by conditions on 
Wi, © * *, Wp the definition of x,, becomes precisely that given above. In 
practice ¥, is usually determined as in this particular case; it is the Borel 


294 MARTINGALES VII 


field of œ sets determined by conditions on the random variables of a 
given collection which depends on #, and increases with t. The general 
principle we shall use repeatedly is that as more and more conditions are 
imposed in a conditional expectation, the ordered family of random 
variables obtained is a martingale. 

A semi-martingale is a real process {x,, t e T}, defined in the same way 
as a martingale except that the critical equality is replaced by an inequality ; 
“=” is replaced by “<” in (1.1), (1.2), (1.1). The inequality 


(1.2s) fudP<fadP, s<z, 
A A 


will be called the semi-martingale inequality. The semi-martingale versions 
of (1.1), (1.2), (1.1’) are equivalent, and any one can be used as the 
defining property of a semi-martingale. It will sometimes be convenient 
although illogical to refer to processes with (1.2s) true with the inequality 
reversed as lower semi-martingales. 

The partial sums of any series of non-negative random variables with 
finite expectations constitute a semi-martingale. More interesting 
examples will be given below. 

We now specialize slightly the martingale and semi-martingale defini- 
tions. Let {x,, t¢7} be a stochastic process, with E{|s,|} <0, t eT, 
and suppose that to each ż e T corresponds a Borel field Z, of measurable 
@ sets such that 


(i) FCF, s<t 


(ii) x, is either measurable with respect to the field F, or is equal for 
almost all w to a function that is; 
either 


(iii) x, = E{x, |F,} 


with probability 1, whenever s < t, or 
(iiis) the process is real and 


x, < Efe, | F,} 


with probability 1, whenever s < t. 

Conditions (iii) and (iiis) imply the martingale equality (1.2) and semi- 
martingale inequality (1.2s) respectively, if A €F „ or if A differs by at 
most an w set of probability 0 from a set in F,,, in particular if A is defined 
by conditions on the a,’s for r < s. Then the x, process is a martingale 
if (iii) holds, a semi-martingale if (iiis) holds. We shall denote such a 
martingale or semi-martingale by {x,, F, t ¢ T} to call attention to the 
F ’s, and describe the process as a martingale or semi-martingale relative 


§1 DEFINITIONS; MARTINGALES AND SEMI-MARTINGALES 295 


to the F's. According to this definition, if {y,, t « T} is any martingale 
[semi-martingale] and if Y, is the Borel field of sets measurable on the 
sample space of the y,’s with s <1, then {y,, Y,, t¢ T} is a martingale 
[semi-martingale]. Thus every martingale or semi-martingale is one 
relative to certain Borel fields of sets, and when the fields are not specified 
we can always take those defined like the Y,’s in terms of the given 
random variables. 

If {x,, Fp t eT} is a martingale or semi-martingale, and if F, is the 
Borel field generated by the sets of F, and the sets of probability 0, so 
that F / consists of the sets of F, and of the sets which differ from F , sets 
by sets of probability 0, then {x,, F ;/, t e T} is also a martingale or semi- 
martingale. In other words, it is no restriction on generality to suppose 
that the F s contain all sets of probability 0 in the first place. If this is 
true, the alternative in (ii) above is unnecessary since x, is itself measurable 
with respect to F, if it is equal almost everywhere to a function which is. 

Note that, if {x/, F/, te T} and {a,"(w), F;’, t e T} are both martingales 
[semi-martingales], defined on the same œ space, then if F; = F (for 
te T, the process 
fal + 2,",F (te T} 
is also a martingale [semi-martingale]. The process {a/ + x”, t eT} is 
not necessarily a martingale [semi-martingale] without this identity of the 
fields involved, however. If {a,, F, teT} is a martingale, and if 
x, = u, + iv, where u, and v, are real, the processes 


{u,F,teT}, {4 F,teT} 
are martingales, 

TuHeoreM 1.1 (i) If {£ Fo tT} is a semi-martingale, and if ® is a 
real function of the real variable å, which is monotone non-decreasing and 
convex, with E{|P(x,,)|} < © for some to € T, then {P(a,), Fy t eT, 1S to} 
is a semi-martingale. 

(ii) If {£p F,, tT} is a martingale, then {lei Fo te T} is a semi- 
martingale. 

(iii) If {tn F,, t €T} is a real martingale, and if ® is a real function of 
the real variable 2 which is continuous and convex, with E{|\(«,,)|} < œ 
for some to €T, then {P(%,), Fa teT, tS fo} is a semi-martingale. 

The proofs of these three statements are the same in principle, so only 
the proof of (i) will be given. In that case, if t < fp and if t e T, then 
we have, using Jensen’s inequality (see I §9), 


(1.3) O(x,) < DEn |F H ELDE) | Fi} 
and (convexity of ®) there is a positive constant ¢ such that 
ch < OA) 


296 MARTINGALES VII 


for sufficiently negative 4. Then ®(x,) is at most equal to the integrable 
function on the right in (1.3) when ®(@,) is large, and is at least equal to 
the integrable function — cx, when — ®(a,) is large. Consequently P(x) 
is itself integrable. Finally, the inequality (1.3) between the extremes 
must now hold for any pair of parameter values s, Sọ (instead of £, tọ), if 
S< S£ fo and this is one condition necessary and sufficient that 
{O(a,), Fa t € T, t< to} be a semi-martingale. 

The following are the most important applications of this theorem. 
(We define logt A as usual as 0 if 2 < 1, log.A EA SA 

(a) If the x, process is a semi-martingale, the processes with random 


variables 


are semi-martingales if the relevant expectations exist. 
(b) If the x, process is a martingale, the processes with random variables 


{mJ log" |x|}, fel @S) 
are semi-martingales if the relevant expectations exist. 
Now consider any series > y; of random variables whose expectations 
exist. The sequence of partial sums is a martingale if and only if 


(1.4) Efymalyy?* sYnt=0 221 


with probability 1, according to II §7. The y,’s are easily modified by 
subtracting the proper expectations to obtain a new series 


2 (Ys — Ely; ly Ysa} 


whose summands have the property (1.4), so that the new sequence of 
partial sums is a martingale. Thus, if in any particular case it is possible 
to make suitable estimates of the subtracted expectations, a general series 
can be reduced profitably to one satisfying (1.4). We shall find it more 
convenient to perform this operation on the partial sums. In terms of 


these the argument runs as follows. Let 2, %,* - > be any sequence of 
random Variables whose expectations exist, and define x,’, A, by 

4, =0 
(1.5) 


n 
ia = E, + A A; Am Efe, lepta a 


Then the x,’ process is a martingale (because, ifm < n, E{a,’ |81," ` *s 2m} 
= £p). In any particular application enough must be known of the A,’s 


81 DEFINITIONS; MARTINGALES AND SEMI-MARTINGALES 297 


to make the reduction of x, properties to x,’ properties useful. In 
particular, if the #, process is a semi-martingale, the A;’s are non-negative. 


n 
Conversely, if x, can be put in the form x,’ + > A;, where the Lg 
j=1 
process is a martingale and where more particularly 
Efe,’ (o'sim) =m, m<n 


with probability 1, and where the A,’s are non-negative, then the 2, 
process is a semi-martingale. This characterization of a semi-martingale 
in terms of a martingale will play an important role below. 
Example 2 Let Yı, Yə * ` * be mutually independent random variables 
n 


whose expectations exist, and let x, = Zy,. Then the x, process is a 
1 


martingale if and only if Efy,} = 0 for j> 1, a semi-martingale if and 
only if the y;’s are real and Efy,} >0 for j> 1. In the latter case the 
representation 

yan 


z, =y + Sly Bl + 2 Elu) 


is a simple special case of (1.5) in which the A,’s are non-negative constants. 
If {£n F n n > 1} is a semi-martingale, (1.5) can be generalized slightly 


to become 
A, =0 


(3) 
ty =f +A, A= Efe lF) ta j> 
1 


In this representation {a,’, F p, n > 1} is a martingale, A; = 0, and A, is 
measurable with respect to Fj. 4 
In the following we shall say that a process {a,, t e T} is dominated by 


a semi-martingale {a;+, t € T} if 
P{\x(w)| < 2,t(w)} = 1, teT. 

THEOREM 1.2 Let {€n Fn, n= 1} be a semi-martingale, and consider 
the representation (1.5’). t 

(i) If L.U.B. E{x,} < ©, then 5 {A} <o, and Z A; <o with 
probability 1. 

(ii) If L.U.B. Ef{|z,|} < ©, then [in addition to the conclusions of (i)], 
L.U.B. E{(z,’|} <o. 

(ii) If the £ „s are uniformly integrable, then [in addition to the conclusions 
of (i) and (ii)] the x,°s are uniformly integrable. 


298 MARTINGALES VIL 


(iv) The semi-martingale {£ p, F n, n > 1} is dominated by a semi-martin- 
gale, and in fact we can set 


(1.6) ta = |en | +È Aj. 
1 


Proof of (i) Using the martingale equality, we find 


Ef, } = Efri} = Efri), 
so that 


a7 E{e,} = Efa} + 3 {Ay}. 


Then under the hypothesis of (i) the partial sums on the right are bounded 
when n — œ, so that the conclusion of (i) follows at once. We observe 
that according to the semi-martingale inequality [or (1.7)] the left side of 
(1.7) is monotone in n, so that the hypothesis of (i) is simply that 
lim Efx,} < œ. 
Proof of (ii) If the hypothesis of (i) is strengthened to that of (ii), the 
left side of the inequality 


Elle) < Ellen) + SECO) 


must be bounded in n. 
Proof of (iii) If the hypothesis of (ii) is strengthened to that of (iii), 
the left side of the inequality 


o 


en'l <= EA T 2 A; 


must be uniformly integrable. 

The truth of (iv) is obvious. 

In this chapter the parameter values of the processes will not necessarily 
range over intervals or over sequences of consecutive integers. The 
properties (1.1’) and the corresponding inequality for semi-martingales 
are obviously meaningful as long as the range of values of ¢ is an ordered 
set. For example, we shall sometimes discuss an ordered family of 
random variables of the type 


Tis XQ," ° "sos OF Zos" * *, V4, Toe 


We shall let the parameter set T be any set of the infinite line [— œ, 00], 
closed at the ends by the addition of the points + 00, and topological 
concepts will be used accordingly. For example, the two parameter sets 
listed above are closed sets in this topology, but the set of all integers is 
not closed, because + œ are limit points which are not in the set. 


§2 APPLICATION TO GAMES OF CHANCE 299 


2. Application to games of chance 

Suppose a gambler with fortune x, plays some game of chance once, 
and that his fortune after the play is xẹ Then æ, is a random variable, 
and the game is usually considered “fair” if 


Efx.} = 2}. 
This definition of fair is of course somewhat arbitrary, although hallowed 


by tradition. If the gambler then plays the same or some other game, 
the preceding criterion of fairness becomes 
Efx; | xa} = % 
(with probability 1), where wg is his fortune after the second play, and 
where his choice of the second game may depend on #,(w). Continuing 
in this way, it becomes clear that one natural definition of fairness is the 
martingale condition 
Efri laot * 5 nt = Fn» n=1,2,+ °° 

(with probability 1), where «,, is the gambler’s fortune after the (n — 1th 
play. (Since 2, = const. it makes no difference whether x, is present 
among or absent from the conditioning variables. This irrelevancy of 2 


will not be a fact below, however.) 
A closer analysis of the ideas involved in this concept of fairness 


suggests that the first play should only be considered fair if 
Ef, |y} = % 
with probability 1, where 41 indicates one or more random variables 


representing past history, as known to the gambler, up to the time of the 
play. At the next play the criterion of fairness becomes 
Efx; | yo} = tz 

with probability 1, where / represents the past history known to the 
gambler, including, for example, the value of 2, up to the time of the 
second play, and so on. These fairness conditions are slightly stronger 
than the first ones given, and suggest the following mathematical model 
of a fair game: a fair game is a martingale {x,, F ,,n = 1} relative to 
some stated Borel fields. The Borel field F,, represents the influence of 
the past up to and including time n. In the same spirit we define: a 
favorable game is a semi-martingale {€n F n n = 1} relative to some stated 
Borel fields. We do not make the hypothesis that æ; is identically constant, 
with probability 1, although this is suggested by the interpretation we 
have given. 

If the concepts of fairness and of adva 
consistent, fair and advantageous games mus 


ntageousness are to be self- 
t remain so when the gambler 


300 MARTINGALES VII 


(or house) adopts certain acceptable practices which have the effect of 
changing the given process. This fact suggests various theorems which 
are the subject of the present section, and which have important theoretical 
applications. 

Suppose, for example, that the gambler decides to leave, instead of 
playing indefinitely, either because he thinks that he has won (or lost) 
enough, or because he is discouraged by the way the game has been going, 
or for any other reason. Then the game is still fair (or advantageous) if 
it was so originally, unless the gambler has quit because he can foresee the 
future, and knows, for example, that the next plays will go against him 
(or because he knows that the next plays will go in his favor, and he 
chivalrously refuses to take advantage of his extrasensory powers). The 
mathematical formulation of this conduct of the gambler is the following. 
Let m be a random variable which may take on the value + œ with 
positive probability but whose finite values are non-negative integers. 
(The gambler stops after making m plays.) Define the random variables 
Ži Ža eee by 

Eo) =a(o), j< mw) 
= Liu), j> mo). 


It is supposed that the condition m(w) = jis a condition only on the past 
up to the wth play; more precisely, it is supposed that 


{m(w) = whe F p 


The transformation from {tp F „n = 1} to {ž, n > 1} will be said to be 
a transformation under a system of optional stopping. Tt is natural to 
expect that under this transformation a fair game remains fair and an 
advantageous one remains advantageous, that is, that a martingale goes 
into a martingale and a semi-martingale into a semi-martingale. These 
invariance properties are the subject of Theorem 2.1. The extensive use 
of this theorem in §4 to prove convergence properties of martingales 
shows that it is more profitable to gamble on the convergence of a sequence 
than on the color of a card. Although Theorem 2.1 is a very special 
case of Theorem 2.2, we prove it separately here to simplify the reading 
of §4, 

THEOREM 2.1 Suppose that the semi-martingale [martingale] {x,, F n> 
n > 1} is transformed into the process {%,,, n > 1} under optional stopping. 
Then the &, process is a semi-martingale [martingale]. In the semi- 
martingale case, 

Ee} < EE} <Efe,}, 2213 


in the martingale case, 
Eff} = Ef}, 51. 


§2 APPLICATION TO GAMES OF CHANCE 301 
Since #,, is equal (in pieces) to 2, - - *, Xn, it follows that 
n 
E{[#,)} < X E{le)} < o. 


Suppose that the x, process is a semi-martingale. To prove that the Xa 
process is a semi-martingale, we must prove that, if A is any @ set deter- 
mined by conditions on #, * - +, #,, then 


(2.1) {tae fear. 
A A 


If m is the function defining the optional stopping, as explained in the 
definition of this concept, #,(@) = €n(.)(@) if j => m(w). Hence 


(2.2) | mdp [. #4, 
A {m(w) <n} A {m(m) <n} 

because the integrands are identical. Now A is defined in terms of 
conditions on #,, + * :, @,, and the latter random variables are defined in 
terms of 2, + * +, @, in such a way that it is trivial to verify that A is also 
defined by conditions ona, * *,%,. The œ set {m(w) < n}, and therefore 
its complement {m(w) > n}, are also defined by conditions on %,° * *, Gns 
and finally this means that A{m(w) > n}is so defined. But then, using the 
semi-martingale property, 


(2.3) f phas | 2,4P, 


A fmio)>n} A fmio)>n) 
and we can replace £p, by €,,1 and #, by #,, in the integrands, since 
Bn i(@) = Ep4(@), Zaw) = žo), . if m(w)>n. 


Adding (2.2) and (2.3), we find that (2.1) is true, that is, the #, process is 
a semi-martingale. If the x,, process is a martingale, there is equality in 
(2.3) and therefore in (2.1), so that the #,, process is also a martingale. 
In both cases, x(w) = 4,(@). Hence in the martingale case 


E{z,} = E(t} = Ef}, nad. 


In the semi-martingale case we find, using the semi-martingale property, 


n-1 
Ee}=Eei<Eei—= > | weet f mdP<Ela}. 
I=} (m(o)=3) {m(o) =n) 
We now generalize the concept of optional stopping to that of optional 
sampling. Optional sampling transforms the process {%,,, F p n = l} into 


302 MARTINGALES VIL 


a process {i,,, n > 1} as follows. Let m, m, * + * be a finite or infinite 
sequence of integral-valued random variables with the properties 


l<msms:*'<0 

{mo) = he Fp 
the inequality to hold with probability 1. Define 
€(o) = Xm (w)(O)s j2i. 


In gambling terminology, the change from v, to #,, corresponds to having 
the gambler sample his fortune only at certain times dependent on the 
past and present. In particular, if m is an integral-valued random 
variable determining optional stopping, as explained above, and if m; is 
defined by A 

mo) = Min (mo), j], 


the m,’s satisfy the conditions imposed on an optional sampling m; 
sequence, and the optional sampling determined by these m,’s yields the 
same #,, process as the optional stopping determined by m. 

THEOREM 2.2 Suppose that the semi-martingale {%q, Fn, n= 1} is 
transformed into the process {#,, n = 1} by optional sampling. Then, if 


(2.4) E{|z,|}< 0, n>, 
and if 
(2.5) lim inf j tydP=0, n=l, 


e D 
it follows that the &,, process is also a semi-martingale, with 
(2.6) Ef} < Ef%,}, n21. 
Condition (2.4) is always satisfied if L.U.B. E{|x,|} < 2. 
Under (2.4) and ` 
2.5) liminf f |ey[dP=0, n21, 


N>2 (m,(w)>N} 
(2.6) can be strengthened to 
(2.6) E(z,}< Efé,} < L.U.B. Efe}, n= i, 
I 


and if the x,, process is a martingale, the Ž„ process is also a martingale, 
and there is then equality in (2.6’). 


§2 APPLICATION TO GAMES OF CHANCE 303 


Each of the following conditions C.-C, implies the validity of (2.4) and 
(2.53. 

C, The x,’s are uniformly integrable. 

C, Each m, is a bounded random variable (with probability 1). (This 
condition is always satisfied in the case of optional stopping.) 

C, There is a constant K such that, for each j, 


Efm} < ©, 
Elfen a AE for n< mo), 
with probability 1. 

C, The x, process is a martingale, (2.4) is true, and there is a random 
variable z > 0, with E{z} < 00, and a sequence jy <j, <* ` ` of integers, 
such that 

EAI => EZA — 7, k>2ji2z L, 
with probability 1. If |æ] is replaced by x;, here, the thus weakened 
condition implies the validity of (2.4) and (2.5) even if the x, process is only 


a semi-martingale. 
The proof will be carried through in several steps. 
(a) Proof of (2.4) if L.U.B. E{|æ„|} < © Fix n, and suppose that the 


optional stopping transformation defined by m, takes {a,, k = 1} into 
Zy k = 1}, so that 
z(w) = xw), k < m,w) 
z Xm, (oO) k>m,(o). 


Then lim zę = 2, = Žņ„ with probability 1, and according to Theorem 
A n 
k > 1} is a semi-martingale. In view of Fatou’s 


2.1 the process {zp 
hat L.U.B. E{|z,|} < ©. Now 
X 


theorem, it now suffices to prove t 


Efla} =2 f dP- Ef) 
tex) >0) 
kl 
<2 [) dP+2 > | %aP—Ef), 
j=l fa(a)> 
Grey Tet 
and, using the facts that = tı, and that {a,, j > 1} is a semi-martingale, 
kel 
Ela<2 | zetz > | dP- Efa) 


j=1 4 
>0 zo) >0 
OOE mo) 


< 2E{|x,|}— Elm}. 


304 MARTINGALES 


VII 


(b) Proof under (2.4) and (2.5) of the semi-martingale inequality for the 
#, process We prove that, if A is an w set determined by conditions on 
%, °° Č„ if the w, process is a semi-martingale, and if (2.4) and (2.5) 


are true, then 
(2.7) Jamz ena, 
A A 


and in fact we shall prove the stronger inequality that, if 
A; = A{m,(w) = j}, 


then 
(2.8) feaP< (maces 
Ay A; 
Define 
Ay = Am0) =j, malo) =k} k Sj 
Ax = Afm, 0) = jal) > k} k Sj, 
k 
= A,— U Ajr 
Then ‘ 
AjeF 5 Mine Fy, Aj! EF ps 
A= U Aye 
kaj 
Hence, using the semi-martingale inequality, 
(2.9) hes La aie Laas ie dP 
Ajj 


ifa ae fama 


Ay 
aan f TadP + | ty,dP 
Ajj Aya’ 
<feur+ jz Ti dP + Jie Tiyo dP 
Ajj Ajj 


< 3 farde + J zyar 


Ajk Ajy’ 
E na dP + | ay dP. 
Ajy’ 


2 Y Asx 


§2 APPLICATION TO GAMES OF CHANCE 305 


When N — œ, the first term in the last line becomes 


if 4, dP. 
Ay 
Hence, in view of (2.5), 


findes l žna dP + lim inf jj ay dP 
Ay No Ay 


A; 


j 


< f čna dP, 


Ay 


and we have thus defived the desired inequality (2.8). In particular, if 
the x, process is a martingale, (2.9) becomes an equality, 


faa? = f tan dP + | xyaP. 
A; N Ajy 

U Ajr 

k=j 
When N —> œ in this equation, under the assumption that (2.5’) is true, 
we obtain (2.8), and therefore also (2.7), with equality. Hence in this 
case the &,, process is a martingale. 

(c) Proof of (2.6) under (2.4) and (2.5) Suppose first that m,(m) = 1. 
Then, if the x, process is a semi-martingale, and if the semi-martingale 
property is invariant under a given system of optional sampling, the čą 
process will also be a semi-martingale, so that 


E(t} < Ef@,}, n=l, 


and since x(w) = &,(@) we have obtained (2.6). In the martingale case 
this reasoning yields equality rather than inequality. The proof has 
supposed that the first random variables of the x, and #,, processes are 
identical, The following argument shows that this supposition is actually 
no restriction. In fact, define 


z(o) = Efe}, F= Fy mw) = 0. 


The augmented process {z,,-F n, n = 0} is a semi-martingale or martingale 
if and only if the given process is; the augmented system of optional 
sampling takes xo into % but is otherwise the same as before; (2.4) and 
(2.5) or (2.5°) are valid for the augmented process and optional sampling if 
they were valid before; finally E{vo} = E{a,}. Hence the truth of (2.6) 
for the augmented process and optional sampling, already proved, imply 
its truth for the original process and optional sampling. This reasoning 
shows that, if the x,, process is a martingale, and if (2.4) and (2.5’) are true, 
(2.6) becomes an equality. Finally we prove the right-hand half of (2.6) 


306 MARTINGALES Vil 


under (2.4) and (2.5’). The inequality is the same as the left-hand half, 

and there is equality throughout, if the x,, process is a martingale. In the 

semi-martingale case, we note that, using the semi-martingale inequality, 
Bes = a se) | sagae 


{m,(0) <N} {m,(o)>N} 


25 J aat f zar 


I=! im (@)=3} {m,(@)>N} 
< J ay dP + J Žž, dP 
{m,(o) SN} {m,(o)>N} 


<Efey}+ Í jzy| dP + f č, dP. 
{m,(@)>N)} tm, (@)>N} 

When N — oo this inequality yields the desired inequality, in view of (2.4) 
and (2.5). 

(d) Each one of the conditions C,-C, implies (2.4). The hypothesis of 
C, implies that L.U.B. E{|x,|} < œ, and thereby, according to part (a) 

n 

of this proof, that (2.4) is true. Alternatively the truth of (2.4) can also 
easily be derived from the fact that, according to Theorem 4.1s (ii), there 
is a semi-martingale {x„t, F,, |<< œ} which dominates the v, 
process. (It is easy to show that under C, the #,,’s are even uniformly 
integrable.) Under C,, if m, is bounded by the integer N, with proba- 
bility 1, Ne 
E{|#,|} < 2 E{|z;|} <, 


since č, is equal to x}, > + *,@y in pieces. Under C, define 
y = || 
v= la eal J>. 
Then the y,’s are non-negative random variables, and 
Ely |F ri} < K, k<m(o), j>1, 
|e,(o)| <m(o) ++--+y,(0) n< m(o), j>1, 


with probability I. In the following proof we shall use only the existence 
of a set of y„’s with these properties, and C, could therefore have been 
stated slightly more generally. Define 


Za = 2Yp Zn = DY; 


§2 APPLICATION TO GAMES OF CHANCE 307 
Then the 2, process dominates the x,, process, and 


Eile, 1} 


Il 


a co 
> folelarcegi= >. | sa? 
E=1 tm,(o)=k} k=l (m (w)=k} 


Y dP. 
E=1 (mo) >k) 
Since the w set {m,(w) = k}, as the complement of {m,() < k}, is an œ 
set in the field F „4, the last sum can be written in the form 


Sf Ely |FiakdP<K > Pim) = k} = KE(m)}. 
F=1 tm, (o)>h} KSI 


Hence Efž,} and E{ž,} exist, and in fact we have obtained the inequality 
Ef|č,l} < Efz,} < KE{m,}. 


Finally, under C4, (2.4) is true by hypothesis. 
(e) Each one of the conditions C.-C, implies (2.5’) Since 


lim P{m,(w) > N} = 0, 
Noo 


(2.5’) is true under C, (uniform integrability of the a,’s). Under Cy, 
P{m,(@) > Nj} =0 for sufficiently large N, so the integral in (2.5’) 
vanishes for sufficiently large N. Under Cs, using the notation introduced 
in (d), 
|xy| dP < ij ž, dP, 

{m,(w)>N} {m,()>N} 

and the integral on the right goes to 0 when M —> œ because, as we have 
just proved, Efz,} < co. Under the weak form of Ca, if j;> N, and if 
the x, process is a semi-martingale, 


epdp £ [ tyadP< f ayudPt+ J enp dP 
N wN+1 (o)> N+. 
{Peo Eee PROEL) ra) >0 


ae) 


iy 
< a, dP+ f «dP 
k=N+1 =k nh) 
Sen moO 
A 2 
llar > f ded +aaP 
k=N+1 (m (0) =k} K-54 (m,(o)=k) 
< f (&|+2aP. 
{n,(o)>N} 


308 MARTINGALES aai 


Since the last term goes to 0 when N —> oo, (2.5) is true. If the 2, process 
is a martingale, then the validity of C, for this process implies that of the 
weak form of C, for the |x,,| process, which is a semi-martingale. Hence, 
by what we have just proved, (2.5) is true for the |a,| process, that is, 
(2.5’) is true for the x,, process. 

The following example, which has interesting implications in the study 
of random walks, illustrates the possibility of applying Theorem 2.2 even 
when there is only a single m,. 

Example 1 Let {«,, n > 1} be a real martingale, and suppose that 


P{L.U.B. x,(@) > Efx} = 1. 
n> 


Define m(w) as the first subscript J for which x,(w) > E{x,}. Then, if 
Theorem 2.2 is applicable, the %, process, where 7 only takes on the 
value 1, and #, = Zm» iS a martingale, with 

Efx} = Efi}. 
In the present case the martingale property of the %, process is vacuously 
Satisfied but the preceding equality between expectations is obviously 
false. Hence the hypotheses of Theorem 2.2 cannot be fulfilled, and their 
failure gives rise to various theorems. For example (see condition C;), 


suppose that y,, Y» © + * are mutually independent random variables, with 
a common distribution function, and that 


Ey,}=0,  E{|y,|} > 0, 
n 
x, = 2 Ys 
Then the a, process is a martingale, and 


Efx} = 0, Effen — tn MESE Eq} = Ely}, 


with probability 1. Thus, if we take F ,, as the field of the w sets which 
are determined by conditions on %, * * *, a that is, by conditions on 
Yv ` * *, Yn, the second half of condition C; is satisfied, with K = E{|y,|}, 
so that the first half cannot be Satisfied. It follows that, under the stated 
hypotheses on the ¥,'s, the condition 


P{L.U.B. x, (cw) > 0} = 1, 
n>1 


which is known always to be satisfied (see Chung and Fuchs [1, 1951), 
implies that 
Efm} = oo. 


In other words, one is certain to reach the positive half of the coordinate 


axis in a random walk described by the x, process, but the expected time 
to reach it is infinite, 


§2 APPLICATION TO GAMES OF CHANCE 309 


In order to discuss another type of transformation which leaves the 
martingale and semi-martingale properties invariant we generalize the 
system theorem of III §5, following Halmos. Consider a fair game, that 
is, from our point of view, a martingale {x,, F „, n > 1}, and define 


Y= t 
Yn = En — Tn- n>l, 


so that y, is the gain on a play, and the fairness of the game, that is, the 
martingale equality, is expressed in terms of the y„’s by the condition that 


Eyn lF = nel, 


with probability 1. The Borel field Z „ represents as usual the influence 
of the past to time n. Ifa game is fair, the gambler should find that it 
still seems fair if he decides to skip some games, basing his decision on 
playing or not playing in terms of the past. This means that, if e,() = 1 
or 0 according as he plays or does not play the game with gain y,(«), the 
condition ¢,(w) = 1 must be based on the past before time n, that is, 


{e,() = 1} EF p-r 


Let m,(w) be the nth integer j with ew) = 1, so that the gambler now 
has gains Ym» Ymp» ` * * instead of Yj, Ya, * * * Then we expect that the 


process A 
Ymp "2 1} 


is still a martingale. To avoid confusion we restate the hypotheses, 
omitting references to gambling and expressing everything in terms of the 
y; process rather than the x, process, to facilitate comparison with the 
system theorem of III §5. We shall include semi-martingales along with 
martingales, that is, we shall allow favorable games as well as fair games. 

Let {y,, n = 1} be a stochastic process, and let F, CHF, C* «+ be 
Borel fields of measurable w sets with the following properties: 


(i) Efl} <o, anal. 


Gi) Yn is either measurable with respect to F „or is equal for almost all 
© to a function which is. For all n > 1 either 


Gii) Elfyn | Fn} = 0 
with probability 1, or the process is real and 
(iis) Elyn | Fn} = 


with probability 1. 


310 MARTINGALES Vil 


We shall write {y,, F n n > 1} to stress the F,,’s, Let my, Mo, © + + be 
random variables taking on integral values, and having the following 
properties: 

(iv) l<m, <m < <0, 

(v) mmo) =k Fro k>j, 
neglecting w sets of probability 0. Define 7; by 

Ü= Ymp JZ. 

The ¥; process will be said to be obtained from the y; process by optional 
skipping. If the y,’s are mutually independent, with a common distribu- 
tion function, and if F, is the Borel field of the œ sets determined by 
conditions on Y1, * * *, Yn, the ¥,’s must also be mutually independent, 
with the same common distribution function, according to III, Theorem 
5.2. The fact that the martingale and semi-martingale properties are also 
preserved is the content of the following theorem. 

THEOREM 2.3 Suppose that a process {y,, F „n > 1} with the properties 
(i), Gi), (iis) of the preceding paragraph is transformed into the process 
{n n > 1} by optional skipping, and suppose that 


(2.10) E{|\g,|}< 0, nt. 
Then the ¥j,, process satisfies 
(2.11) Eén ljo: G20, n>1, 


with probability 1. If (üis) is replaced by (iii), there is equality in (2.11) 
with probability 1. The condition (2.10) is satisfied if either of the following 
conditions is satisfied. 


C, Each m, defining the skipping is a bounded random variable, with 
probability 1. 
C, There is a number K such that, for each ds 


Eynal |F,} <K, n< mw), 
with probability 1. 


To prove (2.11) we prove that, if M is any w set determined by con- 
ditions on ğ, + + +, ¥,, then 


find >0, 
M 


and in fact we shall prove the stronger inequality that, if 


M,=M{m (0) =j} j>n+1 


§3 FUNDAMENTAL INEQUALITIES 311 


then 
[Gur dP = 0, jen, 
M 


that is, 

[ude =0, jontl. 

M; 

This inequality follows from (iiis), since M; eF; if j >2. The con- 
dition (2.10) is certainly fulfilled if each /m; is bounded with probability 1, 
because, if m, < N with probability 1, it follows that 


E(D) < È Ellul} < 


The condition (2.10) is also fulfilled if Condition C, of the theorem is 
fulfilled, because in that case 


B= Sf blr <K > Pim) =j) =K 


=l (m,(w)=3} 
More generally, if « > 0, we could have supposed only that 
Efm} <0, Ef|Ynl ERES Kr’, n < mo), 


with probability 1. This condition reduces to C, if « = 0. 


3. Fundamental inequalities 

In the following, if we write E{|x|}< E{|y|}, this inequality is to be 
significant, making the obvious conventions, even if one or both sides 
are + 00, 

THEOREM 3.1 Let {x,, t € T} be a semi-martingale. 

(i) Efx,} is monotone non-decreasing, and is identically constant if and 
only if the x, process is a martingale. 

(ii) If to, tı € T, with ty < ty, then 

Ej} <— Efu} + Ælle} Oth 


(iii) If the xps are non-negative, and if t,€T, the x's with t< t, are 


uniformly integrable. 
(iv) Suppose that Sı > S2 È` ` `s Sn€ T. Then the x,’s are uniformly 


integrable if and only if 
lim Efz,}>— ©. 
Proof of (i) Ifs <t, then 
x, < Efe, EAN 


312 MARTINGALES Vil 


with probability 1. Taking expectations of the two sides of this inequality, 
we find that 
Efx.) < Efe}, 


and there is equality if and only if there is equality with probability 1 in 
the preceding inequality, that is, there is equality for ali pairs s, ¢ if and 
only if the x, process is a martingale. 

Proof of (ii) If tj) < t< t, then, using the semi-martingale inequality 
and (i), 


Efel} =- Efe} +2 f 2,dP 

{x(w)>0} 
<—Efz}+2 [ x,aP 

{x(w)>0} 

<— Efx} + 2Ef{(x,,|}. 

Proof of (iii) If t<t, 
æ, dP < f z, dP. 
{ay(m) >A} {a(o) >A} 


Since the integrands are non-negative, uniform integrability (for t < t) is 
the uniform (for ż < 4) convergence of the left side of this inequality to 
0 when A -> œ. It is therefore sufficient to prove that 


lim Pfa(w) >a} =0 
Ao 


uniformly for t< 4. This is implied by the inequality 
AP{x(w) = 2} < Efx} < Efe, }. 
Proof of (iv) If the x, ’s are uniformly integrable, L.U.B. E{|x,,|} < 0; 
this implication has nothing to do with martingale theory. Conversely, 


if lim Efx,}>— oo, we prove that 


(3.1) im f |z,|aP=0 
1 (lx, (w)|>2) 
uniformly inn, Let 
K= lim Efe, }. 
Then, using (ii), aa 


AP{\x,,(0)| > 2} < Elle, |} <— K + 2Ælle,]}, 


$3 FUNDAMENTAL INEQUALITIES 313 


so the measure of the domain of integration in (3.1) goes to 0 uniformly 
inn when À —> œ. Now 


(3.2), Augie aR an e J =,¢P—E@,} 


{es (>A) {ra (w) >A} {x,,(@)2—A} 


< w,,dP+ [ zP- K n>N 


= sy ‘A 
(at, ()>4) (2,,(0)>—A) 


= | Ee K 
(les (o>) 
If e > 0, we choose N so large that 
Ef, }—K< S 


2 


and then choose A, so large that P{{x,,(@)| > Aj} is (for all n) so small that 
e 
jx, | dP < P 


(le, (01>) 
and finally choose Ay so large that 2, = 4, and that 
(3.3) enl dP <e n<N. 


(lz, (w)| >A} 
With this choice of Ag, (3.2) and (3.3) imply 


jz, |dP<e, nai 
(len (01>) 


so that (3.1) is true uniformly in n, as was to be proved. 

Examples will be given which will exhibit the fact that (iv) is no longer 
true if the sequence {s,} is monotone increasing instead of monotone 
decreasing. In fact, a semi-martingale {w,, 1 < n < co} will be exhibited 
below with 

a, <o Efe,jJ=—l, 2<%, lim 2, =0=7,, 
(the limit holding with probability 1). The «,’s cannot be uniformly 
integrable, since E{a,} does not converge to E(x} 

Many fairly obvious applications of Theorem 3.1 can be obtained by 
using Theorem 1.1. For example, if {£a teT} is a real or complex 
martingale, and if, for some fọ and « > 1, E{|x,,|"} < œ, then the |a,|* 
process for t < fy is a semi-martingale, E{|x,|*} is monotone non-decreas- 
ing, and the |z,|”s are uniformly integrable for t< tọ This result is 
always significant if « = 1, but, if « > 1, E{|x,|*} may never be finite. 


314 MARTINGALES Vi 


According to Theorem 3.1 (i), if the 2, process is a semi-martingale 
when the xs are ordered in the direction of decreasing ¢ as well as of 
increasing ¢, it is a real martingale in both orders. We now show that 
any process which is a martingale in both orders has the property that for 
every pair of parameter values s, t, 7, = x, with probability 1. To prove 
this it is sufficient to prove that if 


[u-z =o 
A 


for every set A = {y(w) e B} and every set A = {2(w) e B}, where B is a 
Borel set, then x = y with probability 1. We may assume that x and y 
are real. Then, for every real c, 


(y—2x)dP f (y— x) dP f (y—x) dP 


COE; pRa aze) 
=— it (y—2) dP 
Made} 
= J (y—x) dP + i (y—x) dP 
{a(w) >} Helse 
0) =e, 
= i (y — x)dP. 
ARS 
aw) >e. 


Since the first integrand is > 0, and the last is <0, the first and last 
integrals must vanish, so that 


P{ylo) > c> 2(o)} = Ply(w) <c <2(w)} = 0. 
Then, restricting c to be rational, 
Piy(o) F OES 2 P{y(@) > c> x(w)} + > Piya) < c <2(w)} = 0, 


as was to be proved. 


THEOREM 3.2 Let {vp 1 < j < n} be a semi-martingale, and let }, be any 
real number. Then í 


G4) Mazo) | z dP < Elle} 
% maea 


G4) Mineo) <> fO edP- Efe, a} 


{Min zw) <å} 
J 


= Efe} — Efel}. 


§3 FUNDAMENTAL INEQUALITIES 315 


We prove the two parts of this theorem in different ways, to illustrate 
the possibilities. Inequality (3.4’) is proved by direct calculation as 
follows. Let A = {Max 2x,(w) > A}, and define the w set A, as the w set 


J 
for which 2,(«) is the first 2,() with x,(@) > A: 
A, = {,(@) = 4}, 
A, = {afo) AI Sj <k; z(o) > 2}, Keele 


Then A, is determined by conditions on the 2,'s for j < k, the A,’s are 
disjunct, and y A, = A. Using the semi-martingale inequality and the 


fact that x(w) >A on Ap we find that 
f a, dP = ale dP >> | x,dP 
A KAk k Är 
>14 X P{A,} = AP{A, 
S 
which proves (3.4’). 


We prove (3.4”) as an application of a game theorem. Let 
M= {Min x(w) < A}, let m(w) be the first value of j for which x,(w) < å, 


if w e M, and let m(w) = n otherwise. Then the condition m(w) = k is 
a condition on æ, * * *,2,. We now apply optional stopping, determined 
by the random variable m just defined, as described in §2. Then, in the 
notation of Theorem 2.1, %, = 2, and according to that theorem 


Efx} < Efé,}, 
so that 
Efn) < Ef} = | #,dP + | 2, aP 
M Q-M 


< AP(M} + Efe,}— [ 2, dP, 
M 


which implies (3.4”). 

This theorem contains III, Theorem 2.1 (essentially Kolmogorov’s 
generalization of the Chebyshev inequality), because, if {x;, 1 < j < n} is 
a martingale, and if E{|x„|?} <0, then {\x,2, 1 <j <n} is a semi- 
martingale, so that, if £ > 0 we have, from (3.4’), 


1 
P{Max |x,(w) |? > 2} < 2 E{|z,|"}. 
This inequality is precisely IRQIY: 
Let £e +, &, be any real numbers, and let 74, 72 be real numbers with 
rı <ra The number of upcrossings of the interval [r,, rs] by lane pet 
is defined as the number of times the sequence &,- * *, &, passes from 


316 MARTINGALES Vil 


below r to above r} More precisely, let &,, be the first &, (if any) for 
which é, < r}, and in general let &,, be the first £, (if any) after &,., for 
which 
ET (j even) 
&<r, (jodd), 
so that 
ES Peas eae oe 


Then the number of upcrossings is f, where 2£ is the largest even integer j 
for which ¢, is defined, and f = 0 if &,, is not defined. 
THEOREM 3.3 Let {x,, 1 < j < n} be a semi-martingale, and let p(w) be 


the number of upcrossings of [r,, ra] by a sample sequence [a,(w), <- +, x,(«)}. 
Then 
1 E iz 
G9 HAs (n — ry ap < Hlal E In) 
Ta — Ti aaar) pee 


To prove the theorem assume first that the a,’s are non-negative, and 
that r,=0. Define the œ functions v, ---, », in terms of 
é = aw), + +, Ën = Xw) as described above, defining »,(w) =n + | 
if the above definitions leave »,(w) undefined. The vys and f are now 
random variables. Define the random variables up, © © +, u,, % by 

u(o)=1, j<»,() 
=1, vw) <j<r(@)  (ieven) 


— 0) V(@) <j < vilo) (i odd) 


WP ERS Uf; — t). 
Then 
fulo) = 1} = VEGA) <j}— pelo) <j} even), 


so that the w set on the left is determined by conditions on z}, © © +, 2;_). 
Hence, using the semi-martingale property, 


(&;— 2,4) dP > 0. 
{uo)=1} 
It follows that 


Ge) Ef} = Efa,} + 5 f @;— z) dP > 0. 
I=? fuya)=1} 


On the other hand, 
z(o) < x,(@) — rlw), 


§3 FUNDAMENTAL INEQUALITIES 317 


so that 

0< Efa} < E(e,}— E(B}, 
as was to be proved. In the general case, without the restrictions that 
the x;’s be non-negative, and that r, = 0, define 


2j(@) = Max [x;(w) — rı, 0]. 


Then the x; process is a semi-martingale, by Theorem 1.1 @. The 
number of upcrossings of [ñ 72] by the x,(«)’s is the number of up- 
crossings of [0, rą — 7] by the 2s. Hence we can apply the special case 
of the theorem already derived to obtain 
Ef, i 
Ep} < SS | @—n)aP, 


Te—M a a>) 


which was to be proved. 

The following theorem sharpens Theorem 3.2. The important point 
about both is that the number of random variables in the semi-martingale 
is not involved, so that the theorem is applicable to martingales with 
infinitely many random variables, as long as there is a last one. 

THEOREM 3.4 If {a;, 1<j<njisa semi-martingale, and if the x,'s are 
non-negative, then 


8.7)  E{Max2}< ear zar Efx, logtz,}, «=1, 
j e—1 e-l 


a a 
(2) wee) ai 
a— l1 


Note that thè coefficient in the second inequality is a decreasing function 
of «, decreasing to e when « > ©. In view of Theorem 3.2, (3.7) follows 
at once from the following theorem, which is stated separately for later 
reference, and because it has independent interest. 

THEOREM 3.4 If x and y are non-negative random variables satisfying 


the inequality 
1 
(3.8) Pio) ><; i «dP 


for all A> 0, then 
3.7) Efy}< Ai + a E{xlogta2}, «=l, 


xe () Efe"), woe 


318 MARTINGALES VII 


To prove this theorem, let ¥ be a monotone non-decreasing function of 
4, A> 0, with ¥(0) = 0. Then 


(3.9) E(Yy)} =— | YO dPfyo) = 2} 


< f Phyo) =4}d¥0) 


ER PS 


MOESA 


Jea 
a ‘pes ava) 


0 
To prove the first inequality of (3.7), define Y by 
¥Y@=4, Aa, 


=O), <<: 
Then, using (3.9), 


By-}<EKY}< f clogyaP. 
{y(w) = 1} 


The first inequality of (3.7) now follows at once, in view of the inequality 
alogb <alog*a + (a>0,b> 0). 


To prove the second inequality of (3.7), define ¥ by 


YA) = 72, 
Then, using (3.9), 


Ey} < Bey}, 
so that, applying Hölder’s inequality, 
1 
Ely} < — Ee) E foe), 


which yields the desired inequality in (3.7). 


84 CONVERGENCE THEOREMS 319 


4. Convergence theorems 

We shall prove a succession of convergence theorems for martingales 
and semi-martingales in this section. The theorems are not much weaker 
for semi-martingales than for martingales, but to clarify the discussion 
the martingale theorems and semi-martingale theorems will be stated 
separately, at the expense of some duplication. 

If a4, £o * © * are the random variables of a martingale, we have seen 
that, if E{|2,|?} < oo, we can write the x,’s in the form 


n 
t, = È Yp 
j=1 
where the y,’s are mutually orthogonal. Then 
Efe} = > Bll 


and this exhibits the fact that Ef|z,|*} is monotone non-decreasing in n, 
as it should be according to Theorem 3.1 (i). According to IV, Theorem 
4.1, Lim. a, exists if and only if > E{|y,|*} < œ, that is, if and only if 
lim E{|a,|2} < oo. The following theorem sharpens this rather super- 


no 
ficial result. 
THEOREM 4.1 Let {£p F n n= 1} be a martingale. Then, by Theorem 


3.1 (i), 
Ef|æ|} < Efe S e 
G) If lim Efe} =K <0, then lim x, = 8a exists with prob- 
ability 1 and Efel} < K. In particular, K < © if the «,'s are all real 


and > 0 or all real and = 0. 
(ii) The following conditions are equivalent: 


(a) K < œ, and the random variables Xy, tg, ** *, Xo constitute a 
martingale. 

(b) The random variables x,, £g, * + + are uniformly integrable. 

(c) K < œ and Ef|x,|} = K. 

(d) K < œ and lim E{\x,,—2,|} = 0. 


If these conditions are satisfied, and if F a is the smallest Borel field of w 
sets with Fa DU F m then {x,, Fn EnS æ} is a martingale, 

(iii) If, for some «> 1, lim E{{x,|"} <9, then the conditions of (ii) 
are satisfied, E{\x,,|*} < ©, and 


lim E{|z,,—%,|*} = 0. 
n>a 


320 MARTINGALES Vil 


Conversely, if the conditions of (ii) are satisfied and if E{\x..|*} < © for 
some «> 1, then 
Elen} < Elle n21. 
(iv) If the x,’s are real, and if 


(4.1) KLUB. (m=) <0 y= 0), 


then lim 2,(@) exists and is finite for almost all œ for which 
no 


lim sup x(w) < 00, 


(v) If 
(4.2) E{L.U.B. ma %p|2}< 00 E=; 


then lim %,(w) exists and is finite for almost all w for which 
nwo 


È Ef|£n1 — trl? | Fn} < ©, and conversely. 
1 


This is the strict sense version of IV, Theorem 7.2. 

Proof of (i) It is sufficient to prove this part in the real case and we 
accordingly assume that the 2,’s are real. Suppose that K < œ and let 
Za, “* be respectively the inferior and superior limits of the sequence 
{xn} Then 
(43) {a%(w) — rlo) + 0} = Ufo) > ra > ry > 274()} (r; rational). 

Ty te 
Fix r, and r, > r, and let f (w) be the number of upcrossings of the 
interval [74, ra] by 2(@),* © -,”,(@). Then f,, > œ monotonely whenever 


x*(w) > rg > ñ > talw), although according to Theorem 3.3 
K + |r, 

ripy K+ bn) 
Tg ly 


These two facts imply that each summand in (4.3) has probability 0, so 
that x* = x, with probability 1, that is, lim «, =~, exists as a finite 
or infinite limit with probability 1. By Fatou’s lemma, |z,,| < co with 
probability 1, and 

Efel} < lim E{lar,|} = K. 


Moreover, there is equality if and only if a, x», +-+ are uniformly 
integrable. In particular, if the x,,’s are all > 0 or all < 0, 


E{|x,|} = E{x,}=const. or  E{|x,|}=— Efx,} = const., 


§4 CONVERGENCE THEOREMS 321 


as the case may be, using Theorem 3.1 (i), and hence K < œo. More 
generally, if we write v,, = a, — a,~, where x„* and x,“ are non-negative, 
then K is finite if E{a,,+} or E{z,-} is bounded in n. For example, the first 
assertion follows from the equality 


E{{x,|} = — Efe,} + 2E(2,*} 
=— Ef} + 2Æfz,*}: 


Thus K is finite if the x„’s are uniformly bounded from above or below by 
a random variable whose expectation exists. 

Proof of (ii) If (b) is true, K < œ. Thus statements (a)-(d) of (ii) all 
either imply or suppose that K < ©, so that w,, is defined in each case. 
Conditions (c) and (d) are simply necessary and sufficient conditions for 
uniform integrability of an almost everywhere convergent sequence of 
functions, and have nothing specific to do with martingales. Condition 
(a) implies (b), uniform integrability, by Theorem 3.1 (iii) applied to the 
|e, process. Conversely, if there is uniform integrability, we prove that 
condition (a) is satisfied, that is, that {v,,.F,, | <n < oo} is a martingale, 
by proving that 


Eft |F Ei = Tn 
with probability 1, for every (finite) n. That is, we prove that 
(4.4) f to dP = |2 dP, AcF,, 
A A 


Since the original sequence is a martingale, if m > n, 


fen aP = | Elen |F,} dP = [ x, dP. 
A A A 


Now, when m-» co, 2, —>2,, With probability 1, and since there is 
uniform integrability we can integrate to the limit as m — co on the left 
to obtain (4.4). 
Proof of (iii) If, for some «> 1, lim E{|2,|*} < 00, the boundedness 
no 
of this convergent sequence of expectations implies uniform integrability 
of the a:,’s so that the conditions of (ii) are satisfied. By Fatou’s lemma, 


Efe} < lim E{|x,|*} < 00. 
Then by Theorem 3.1 (iii) or the deeper Theorem 3.4 the sequence {lel} 


(and therefore also the sequence {|£ — 2,,|*}) is uniformly integrable, so 
that £, >z, with probability 1 implies that lim E{|a,, — tnl} = 0. 


no 
Conversely, if the conditions of (ii) are satisfied, E{|x,|*} < E{|x..|"} by 
Theorem 3.1 (i). 


322 MARTINGALES Vil 


Proof of (iv) . Suppose that the condition (4.1) is satisfied. Let N be 
any positive number, and let m(w) be the first integer j for which x,(w) > N, 
or let m(w) = +o if there is no such integer. Define x") by 


(4.5) z,o) = x,(@), if n<m(o), 
= fmol), if n> mo), 
and let w = L.U.B. (£„}ı — n). The condition m(w) = k is a condition 
n> 


on ty” * *, 2y and the x? process has been obtained from the x, process 
by optional stopping based on the stopping variable m. Then according 
to Theorem 2.1 the process {x,®, n > 1} is a martingale. Moreover, 
x, <N +w. According to the remark made at the end of the proof 
of (i), the fact that x, is bounded from above by a random variable 
whose expectation exists implies that E{|x,|} is bounded in n. It now 
follows from (i) that lim æ, exists and is finite with probability 1. 
Since Th 
2, (ow) =2,(o) if L.U.B. z(o) < N, 
d 
it follows that lim a,(w) exists and is finite for almost all œ with 
no 
L.U.B. x, () < N, that is, with lim sup x,(w) < œœ, since N is arbitrary. 
n 


n> 


Proof of (v) Let N be any positive number, and let m(c) be the first 
integer j for which |x,(w)| > N, or let m(w) = © if there is no such 
integer. Define x,‘%) by (4.5) and W? as the L.U.B. in (4.2). Then, as 
in (iv), the x, process has been obtained from the «,, process by optional 
stopping. According to Theorem 2.1 the process {x„™, n > 1} is a 
martingale, and 


eP <N+W, Efe MY 2N2+ EWY < 0. 


Then the conditions of (iii) are fulfilled, and we conclude that the sequence 
{x,,", n = 1} converges with probability 1 and in the mean. Since the 


series 2 Gnu e) is a series of mutually orthogonal functions 
o 

which converges in the mean, > Eflx,..—2,(%)|2} <0, by IV, 
1 


Theorem 4.1. Hence (with probability 1) 
oO 
È Efer — 2,2 | F,} < 0, 
1 


because the operation E{—} performed on the terms of this series yields a 
convergent series. Moreover, 
GN = 
Ei{[e 41 Vea [Fk = Ehle, — «|? | Fo, n<mw), 


=0, n > m(a), 


§4 CONVERGENCE THEOREMS 323 


so that 
(4.6) > Elena — Tal? | Fn} < 20 
1 


almost everywhere where m(w) = œ, and therefore almost everywhere 
where lim æ, exists and is finite, since N is arbitrary. This proves half 


n>a 


the statement of (v). To go in the other.direction define m(@) as the first 
integer j for which 


3 
2 Elea — a)? |Fj>N 
and then define x, by (4.5). Then, as before, {x,(", n > 1} is a 


martingale, and 


ao m—1 
2 Effen — en P| Fa} = 2 Eleni — pq)? | Fn} SN. 


Hence, taking expectations, 


Efem — 24} = lim Effe,” — 2} < N, 
A 


+00 


+Ms 


so that E{|«,("|2} is bounded when n —> œ. It follows from (iii) that 
lim a,“ exists and is finite, with probability 1, and therefore that 


no 
lim a,(w) exists almost everywhere where m(w) = ©0, that is, almost 
everywhere where (4.6) is true, since N is arbitrary. 

If the sequence ,, %,° * * is a martingale, and if lim E{|x,|} < ©, 


no 


so that lim æ, exists, it is not necessarily true that the sequence 
no 


Zis Wy, * * *, @,, isa martingale. In other words, Theorem 4.1 (i) describes 
a situation more general than that of Theorem 4.1 (ii). Although simple 
examples of this are easily given, we omit them here because such examples 
appear in a natural way in §8. We shall see in §5 that the situations in 
Theorem 4.1 (i) and (ii) are identical if the differences {any — Yn} are 
mutually independent, that is, if the a,’s are the partial sums of a series 
of mutually independent random variables. 

The following corollaries exhibit the power of Theorem 4.1. There 
will be many other applications of the theorem in later sections. 

COROLLARY | Let Yy Yo, `` * be a sequence of uniformly bounded 
non-negative random variables, and let p; = Efy; |Y © `s Ysa}. Then 


the series X y;(w) converges for almost all œ for which > pj) converges, 
1 1 


and conversely. 


324 MARTINGALES VII 


n 
If z,=>(y;—p,), the process {x,,n > 1l} is a martingale. By 
1 
Theorem 4.1 (iv), applied to this process and to{—2,,2 > 1}, lim «,(w) 


n> 


exists almost everywhere, and is finite where 
lim sup x,(w) < 0 or lim inf x,,(@) > — ©. 
no no 


Then 
P{ lim 2,(@) = o}= P{ lim 2,(w) =— of = 0. 
n>a n>o 


It follows that the series in the corollary must converge and diverge 
together with probability 1, as was to be proved. 

CoroLLARY 2 Let My, Mg, « - « be measurable w sets, and let p, be 
the conditional probability of M,, relative to M,,* * +, Mpy, that is, relative 
to the field generated by the latter sets. Then the set of poins in infinitely 


many M;’s, and the set of points of divergence of the series > Po), differ 
by at most a set of probability 0. 

This corollary is sometimes phrased in a more intuitive fashion as 
follows. Neglecting zero probabilities, infinitely many events of a given 
sequence Ey, Ep, > > + occur if and only if the series of conditional probabilities 
of the E,’s relative to their predecessors diverges. (Note that these con- 
ditional probabilities are random variables, not necessarily constants.) 

This corollary is a generalization of the Borel-Cantelli lemma (III, 
Theorem 1.2), obtained as a special case of Corollary 1 by setting 
Y,(@) = 1 or 0 according as « is or is not in M,,. 

The following theorem is the semi-martingale analogue of Theorem 4.1. 

THEOREM 4.1s Let {x,, F,, n> 1} bea semimartingale, and let F ,, 


be the smallest Borel field of œ sets with F „ D ÜF, Then, by Theorem 
3.1 (i), 
Efx} < Efx} < 


(i) If L.U.B. Ef|x„|} < 00, then Jim x, = t, exists with probability 1, 


and E{{z,/} <æ. In particular, if th the x,’s are non-positive, E{|z,|} is 
monotone non-increasing, so that this condition is always satisfied; if the 
%,'8 are non-negative, this condition reduces to lim Ez} =K <œ. In 
n>n 
the latter case Ef{x,,} < K. 
(ii) (a) If the ,’s are uniformly integrable, then 


L.U.B. Efa,} < œ, lim E{|z,, — x,|} = 
n 


and the process {x,, Fa 1< n< œ} is a semi- martingale which is 
dominated by a semi-martingale relative to the same fields. 


—— 


§4 CONVERGENCE THEOREMS 325 


(b) If L.U.B. E{\x,|} < ©, so that x,, exists, and if the process 
n 
{£n 1<n< ©} is a semi-martingale, then 


(4.7) lim Efz,} < Efx}, 


and there is equality if and only if the x,’s are uniformly integrable. In 
particular, there is always equality if the &„’s are non-negative. 
(iii) Suppose that the x,’s are non-negative. Then 


Eire} Eigse | Oils 
If, for some a>1, lim Efx,*} <œ, then the xs are uniformly 
integrable, and AR 
Efx "} < 00, lim E{|a,, — 2,|*} = 0. 
no 


Conversely, if the x,'s are uniformly integrable, and if E{x,,*} < 0 for 
some «> 1, then 
lim Efar,*} = Efra"). 
no 
(iv) If 
(4.8) E{L.ULB. [tp — Eftr | F nl} < ©, 
n21 


then lim a,(w) exists and is finite for almost all w for which 


n> 
lim sup 2,(@) < 0. 
no 


(v) Uf 
(4.9) E{L.U.B. [£n — Efn | Fahl} < ©, 


then lim æ,(œ) exists and is finite for almost all w for which 


n=>o 


È [Elen |Fa}— Elen |Fall < 0, 


2 [Ef£n+1 LF n} — tn] <0, 
and conversely. 4 

Proof of (i) The method of proof of Theorem 4.1 (i) [which is a special 
case of Theorem 4.1s (i)] is applicable without change to the proof of the 
existence of the limit x, in the present semi-martingale case. However, 
we give an instructive alternative proof which reduces the desired result 
to that in the martingale case. According to Theorem 1.2 (ii), the 
hypothesis of (i) implies that we can write a, in the form 


n 
(4.10) En =, +È Ay 
i 


326 MARTINGALES Vil 


where {z,’, F n,n = l}is a martingale, the A;’s are non-negative, E{|x,„’ |} 
is bounded in n, > A; < œ with probability 1, and 5 E{A;} < œ. Then, 
according to TA liaa lime an = SiE and is finite with 
. probability 1, so that is 
lim y= Be! + 5 Ay = te 


with probability 1. The statements of (i) are trivial consequences of 
what we have now obtained. 

Proof of (ii) If the hypothesis of (i) is strengthened to the hypothesis 
of (ii) (a) that the-z,’s are uniformly integrable, then we show that the 
process {2,, F n, 1 < n< ©} is a semi-martingale by showing that 

cel WEAN ket n= s 
with probability 1. This can be shown directly, as in the corresponding 
treatment of the martingale case. There is some interest in the following 
alternative method, however. According to Theorem 1.2 (iii) the 2,/’s 
are uniformly integrable in this case, and it then follows from Theorem 
4.1 (ii) that {x,’, Fp, 1 <n < oo} is a martingale, so that 


Ejro |F n} = Eftr + > A; |F,} 
1 


=a, + >, A; 
1 
=, 


with probability 1, as was to be proved. Moreover, if we define Zat as 
the right side of the inequality 


|u| < [ar,"| + 3 Aj, l<n<o, 
1 


the x,, process is dominated by the semi-martingale {x„t, Fp, 1 < n< oo}. 
This result completes that of Theorem 1.2 (iv). Going in the other direc- 
tion, assume the hypotheses of (ii) (b). Then (4.7) is implied by the semi- 
martingale inequality. Moreover, since the process of double the positive 
parts of the x,’s, {|z,| + #,,1<n< ©} isa Semi-martingale, its random 
variables are uniformly integrable, by Theorem 3.1 (iii). Hence 


lim E{len| + zn} = E{lz,.| + va}. 


(In particular, there is equality in (4.7) if the x,’S are non-negative.) On 
the other hand, by Fatou’s lemma, 


lim Efje,| ~2,} > Eflre| — ta} 


§4 CONVERGENCE THEOREMS 327 


Combining these two relations, we find again that (4.7) is true, and that 
there is equality in (4.7) if and only if there is equality in the application 
of Fatou’s lemma above, that is, if and only if the negative parts of the 
x,s are uniformly integrable. Since, as we have already remarked, the 
positive parts of the 2,’s are always uniformly integrable, under the 
present hypotheses, it follows that there is equality in (4.7) if and only if 
the x„’s are uniformly integrable. 

Proof of (iii) If the x,’s are non-negative, the «,,’s constitute a semi- 
martingale for any «> 1 for which these random variables have finite 
expectations. Hence E{z,*} is non-decreasing in n. If 

lim E{z,2} < œ, 

no 
for some «> l, the 2,’s are uniformly integrable; this inference has 
nothing to do with martingale theory. Then, by (ii), ta exists and 
an, Fn, 1<n< oo} is a semi-martingale. By Fatou’s lemma, 
Efx} <œ. Hence, by Theorem 1.1 (i), {w,7, 1 < n < oo} is a semi- 
martingale also. Since the random variables of this semi-martingale are 
non-negative, and since there is a last one, the random variables are 
uniformly integrable, by Theorem 3.1 (iii). Then integration to the limit 


yields : 
L.U.B. Efx} = lim Efe,*} = E{x,.7} < 0 
n no 


lim Ef|x,, —«,|*} = 0. 


nto 
To prove the converse half of (iii) we need only remark that, if 
{£n 1< n< co} is a semi-martingale with non-negative random variables, 
and if E{a,,*} < oo, then the process {x,", 1<n< co} is also a semi- 
martingale, by Theorem 1.1 (i), so that, according to the semi-martingale 


inequality, 

eae Be} Eeh n2. 
We have already seen that there is equality in the limit when n —> œ since 
the w,,’s are non-negative. 

Proofs of (iv) and (v) In the notation of the representation (4.10), 

nga — Ens | Fn} = En — Lp 
Ejen | Fa} Elena | Fah = Elena’ — n PF a) 
Efta | Fn} — 2n = Any 

With the help of these relations, (iv) and (v) are reduced to the corres- 
ponding martingale statements of Theorem 4.1 (iv) and (v), as applied to 
the,’ process. Note that, if E{L.U.B. (p41 — 2n)} < ©, then condition 
(4.8) is satisfied. nea 


328 MARTINGALES VII 


We remark that both in this theorem and in Theorem 4.1 the «th power, 
Os) = $, 


has been used (« > 1), but this was only in view of certain applications, 
and any function ® which is convex and monotone non-decreasing for 
s > 0, with 


® 
ee 


sro S 
would serve as well. 
THEOREM 4.2 Let {x,,, F p n <— 1} be a martingale, and define 


eh il 
F p= OF» 
Ka: 


Then lim x, =2_, exists and is finite with probability 1, and 
n>- 

{%n, Fn, —O<n<—1} isa martingale. The „s are uniformly 

integrable, and 


GID Eflegl}= lim Effen} + + < Blea} < Elle): 


If, for some « > 1, E{|x_,|*} < 00, then 


lim E{|a_,, — a,|*} = 0. 
n>a 


This is the strict sense version of IV, Theorem 7.3. It i8 sufficient to 
prove the existence of the limit z_,, in the real case, and we accordingly 
assume for the moment that the a,’s are real. Let a, 2* be respectively 
the inferior and superior limits of the sequence {z,}. Fix 7, and ř and 
let £,,(@) be the number of upcrossings of the interval [r,, ra] by z,,(@), 

* +, & (w). Then B,,(m) > œ monotonely when m —> — oo, on every 
for which x*(w) > r, > r; > x(w), although according to Theorem 3.3 


Eleal} + ril 


Fa Ty 


E(B} < 

Then 
P{x*(w) > r> n> x,(@)} = 0, 
for every pair rə, r}, and as in the proof of Theorem 4.1 (i) it follows that 
Xy = x* with probability 1, that is, lim x, = x_, exists as a finite or 
n>—0 
infinite limit. The process {|x,|*,— œ <n<— 1} is a semi-martingale, 
for any « > 1 for which 
Efe} < o. 


§4 CONVERGENCE THEOREMS 329 


Since this semi-martingale has non-negative random variables, these 
random variables are uniformly integrable, according to Theorem 3.1 (iii). 
It follows (x = 1) that x_,, is finite-valued with probability 1, that 


lim E{|x,|} = E{|e_.,|}, lim E{|v_,, — tal} = 0, 
n=- œ n—>— o 


Ą 
and that the same is true for the exponent «> 1 if the relevant expecta- 
tions exist. To identify x_,, with Efx |F _-} we must prove that, if 
Ae F_,,, then 


s! f 2 ad= f z dP. 
A A 


Now, because of the martingale equality and the fact that Ae F „ for 
every n, 

fe,aP = [xaP, n<-—l, 

A A 


and when n-»— co this yields the preceding equality (because of the 
uniform integrability of the 2s). 

THEOREM 4.2s Let {x,, Fns n<— 1} be a semi-martingale. Then 
lim 2, =22,, exists with probability 1, and — ò < £- < © with 


n> 


probability 1. Define 


=o 
-o 


Then 
i Efx} > Efra} > s, 


by Theorem 3.1 (ii). If 
lim E{x,} = K>— œ 
n>—o +» 


then &_ is finite with probability 1, and 
lim Efx, = K = E{z_,}, lim E{|a_,, — x|} = 0. 
n+ n>—9 


Moreover, {x,, F n — © < n < -— 1} is a semi-martingale whose random 
variables are uniformly integrable. If the x,s are non-negative, and if, for 
some «> 1, Efx} < 00, then 


lim E{|z_,, — x,|*} = 0. 
na 
The method of proof of Theorem 4.2 is applicable without essential 
change to the present more general theorem. If x, =n, K = — œ and 


&_ = — œ with probability 1. However, Theorem 3.2 is easily applied 
to deduce the fact that L.U.B. 2, < co with probability 1, so that the 


: 


330 MARTINGALES Vil 


same is true of x_,. The following proof for the case K> — œ is 
illuminating in that it illustrates the close connection between martingales 
and semi-martingales. 


If lim Efz,} is finite, the series of non-negative terms 
nc 


2y 
Z [Efx |F; tal 


converges with probability 1, because the expectations of the partial sums 
are dominated by 
"Efx}— . lim "Efr,}. 


Then we define x,’ by 


(4.12) n= tn + > (Efe; IFF Tj], . 

following (1.5’). The process {a,’, F,, n<— 1} is a martingale, and 

therefore lim x,’ = x_,,’ exists and is finite with: probability 1, and the 
no 


, 


process {%,, F,, — © <n<-— l} is a martingale, by Theorem 4.2. 
Then 


$ 
co ; 


ima 2 a 
n>—o 
with probability 1% To prove that the process {x,,. F „, — œ < n < — 1} 
is a semi-martingale we have only to prove that 


naai |F,)}, nai, 


with probability 1, and we have from &-12, since the summands are non- 
negative, A 

E(e, |F o} > Efe,’ |F a) =p. i 
The x,„’s are uniformly integrable because the x,„”’s are [by Theorem 3.1 
(iii)], and the sums in (4.12) are because they are dominated by the infinite 
sum. [Alternatively the x„’s are uniformly integrable by Theorem 3.1 
(iv).] Finally, if the £as are non-negative, and if 


Efe} < @ 


for some æ => 1, then the a,* process is a semi-martingale, with non- 
negative random variables and a last random variable, so that the x,” s 
are uniformly integrable, by ‘Theorem 3.1 (iii), and (integration to the 
limit) : 

lim E{|z_,, —2,|*} = 0. 

ae . 


$4 CONVERGENCE THEOREMS í 331 


THEOREM 4.3 Letz be a random variable with E{\z|} < œ, and let 
po Fi CFC: ++ be Borel fields of measurable w sets. Let 
F _. = QF n and let F, be the smallest Borel field of sets with 
pe 


F UF, Then 
lim Efe |F,} = Ee |F} 


(4.13) 
lim Efz | F,} = Ef | Fa} 


with probability 1. Á 
Define + 
* 2,=Ez|F,}, EE E 


The process {8,4 F p, —-c0 < n< 00} is a martingale because Z, is non- 
decreasing in (see Example 1, §1), so the first equation in (4.13) is simply 
‘an application of Theorem,4:2. The process {|a,|,— œ <n < oof is a 
semi-martingale with non-negative random variables and a last random 
variable, and its random variables are therefore uniformly integrable, by 
Theorem 3.1 (iii). Then, by Theorem 4.1 (ii), lim x, =y exists with 


nn 

probability 1, and the process {tm Fn 1<n< o} is a martingale if zo 
is replaced byy. To identify y with x,, and thus prove the second equation 
of (4.13) we note that z,, is a conditional expectation, characterized 
(neglecting values on an œ set of probability 0) by two conditions: it is 
equal almost everywhere to a random variable measurable with respect to 
F.,, and it has the same integral as z on every Fe set. Now, if A EF n 
the martingale equality applied twice yields 


[yar = fz, dP = fear. 
Y A A 


Since the extreme terms are equal for A € F ,,, and therefore for AcUF,, 
n 


they are equal for Ae F,,. Imfact, Fo is by definition the Borel field 
generated by the field U.F „, and the extreme terms of the above equality 


define completely additive functions of Fo sets which are identical on 
the field U.F, and ‘therefore, are identical on F,, (see Supplement, 


Theorem 2.1). But then y satisfies the conditions characterizing to, so 
they are equal with probability", as was to be proved. 

We remark that, if F „is only defined for sufficiently large or sufficiently 
small n, the relevant half of Theorem 4.3 remains applicable. 

The following corollary covers the most important case of Theorem 


4.3. , 
t% 


332 i MARTINGALES VII 


CoroLLARY | Letz be any random variable with E{|z|} < co, and let 
Ya» Yo, * * * be any random variables. Then, if Y,, is the Borel field of the 
œ sets determined by conditions on the y;s with j > n, 


lim EG | Y% Yn? * }= ER | Q G,} 


(4.13) i 
lim Ez |y" + + Yn} = Ez | Ys Ya > *} 
acs 


with probability 1. In particular, if z is a random variable on the sample 
space of the y;s, the second limit can be replaced by z. 

To reduce the first of these equations to the first in (4.13) identify Y,, 
with F_,. To reduce the second to the second in (4.13) identify with 
F „ the Borel field of w sets determined by conditions on the y,’s for 
j<n. In particular, if z is a random variable on the sample space of the 
y;s, the second limit is z itself, with probability 1, by definition of con- 
ditional expectation. 

The following theorem is an immediate consequence of Theorem 4.3, 
but its form makes it more useful in studying certain problems. 

THEOREM 4.4 (i) Suppose that the random variables w, + + +, % 9, X 
constitute a martingale relative to the respective fields F ,* © *, Fs, F 4, 


and define F _,. = D Fa Then 


lim 2, = Efz | F_,} 


n> a0 
with probability 1, and the random variables 
Wate | Poa}, 2s tog er 
constitute a martingale relative to the respective Borel fields 
Fig Pig? SH gy Pe 


(ii) Suppose that the random variables 2,, a, + » +, z constitute a martin- 
gale relative to the respective Borel fields Fa, Fy, © + -, F,, and let F, 


be the smallest Borel field of œ sets with F, D OF n» Then 
1 
lim x, = Efz | Fo} " 
n>a 


with probability 1, and the random variables 
%, Ta + +, Ef2 | F,,}, 2 
constitute a martingale relative to the respective Borel fields 


Fy, Fa 7 "SF tos Fak 


§4 CONVERGENCE THEOREMS 333 


This theorem shows how to enlarge the parameter set of certain types 
of martingales; the discussion of this problem will be taken up again in 
a later section. The semi-martingale version of the theorem is the 
following. The details of the proof will be given in the semi-martingale 
case, since some are not obvious. 

THEOREM 4.4s (i) Suppose that the random variables w, + > +, £ o, 4 
constitute a semi-martingale relative to the respective Borel fields F + > +, 


Fao F 4, and define F_,. = A Fa Then 


lim 2, =z 


E n=>— o 


=% 


exists and is finite with probability 1, and the random variables 
W, Bigs * * *s E-o, L_4 Constitute a semi-martingale relative to the respective 
Borel fields Fy, Fo" *, P F a 

(ii) Suppose that the random variables xy, %»,* * +, z constitute a semi- 
martingale relative to the respective Borel fields F,, Fo, ` + +, F „ and let 


Fa be the smallest Borel field of œ sets with Fo D OF ne Then 


lim a, = Xo exists and is finite with probability 1, and 


no 
(4.14) lim Efx} < E{a,,} < Efz}. 
The random variables £, Xo, * * *, Ea, Z constitute a semi-martingale (and, 


if so, necessarily one relative to the respective Borel fields Fi, F,°* *, 
F o F) if and only if the first two members of this continued inequality 
are equal, or equivalently if and only if the x,’s are uniformly integrable. 
The stated conditions will be satisfied, for example, if the x,’s are non- 
negative, or more generally if the x, Xp, * * *, z process is dominated by a 
semi-martingale. 

In connection with the last point we remark that, if the x,’s are non- 
negative, we can assume that z is non-negative also, since the positive 
parts of the £, Xə, * * *, 2 process also constitute a semi-martingale. 

Proof of (i) . According to Theorem 4.1s the limit 2_,, exists as stated 
in (i), and the random variables £, ' * -, %,, %_, constitute a semi- 
martingale relative to the indicated fields. There only remains the proof 
that 

w < Efe. |Fu} 


$ 
with probability 1, that is, 


[war< fz dP, AcFy 
A A 


1# . 


334 P : MARTINGALES vii 


Now this semimartingale inequality is true by hypothesis if x_,, is replaced 
by x,, and when n —>— oo we obtain the desired inequality, because 


7 lim E{|z_,, —z,|} = 0 
ae 


by Theorem 4.15, 
Proof of (ii) Under the hypotheses of (ii), 


L.U.B. Ef|zx,|} < 00, 
according to Theorem 3,1 (ii), It follows that z, exists, The rest of (ii) 
then follows easily from Theorem 4,1s (ii), except for the inequality 


relating the last two members of (4.14), The proof of the inequality is 
given as follows. Let K be any real number, and define 


O(s) = Max (0, s). 
Then by Theorem 1.1 (i) the random variables 
(x, — K), D(x, — K), + - +, Oz— K) 
constitute a semi-martingale. Hence 
E(x, — K)} = E(O(z— k)}, 
so that when n -> oo we obtain, using Fatou’s lemma, 


E(x, — K)} < E{®(z— k)}, 
that is, 
(te~ K)dP< (z— K) dP. 
Walm>K) ite)> K) 
Then, if K <0, 


| naes | saP + KPfz(0)> K- KP{x(w) > K} 
Wais K) te) K) 


< | zdP—KPiz(w)< K) 


eb> K) 
S f zd J dlar, 
S (gl <K) 

snd this inequality Becomes the desired one when K > — co. 


5. Application to the theory of sums of independent random variables 
Although the zero-one law is easily proved directly (see II, Theorem 
1.1), it is instructive to derive it from martingale theory. Suppose then 


§5 SUMS OF INDEPENDENT RANDOM VARIABLES 335 


that £y, Zy °° > are mutually independent random variables. The zero- 
one law states that, if æ is for every n a random variable on the 
space Of £a, 2,44, * > * then a = const. with probability 1, To derive 
this result from martingale theory, assume first that E{|z|} < œ. Then, 
since for every n the random variable x is independent of the andom 
variables z, * + *, Za, it follows that 

, Efe [zi + *5 za} = Efa) 
with probability 1, When n — co this becomes, according to Theorem 
4,3, Corollary 1, 

æ = Efz} 


with probability 1, and this is the desired result, If E(x) does not exist, 
define yy(o) as x(w) if |x(o)| <N and as 0 otherwise, Then the result 
just obtained implies that yy = E{yy) with probability 1. This is im- 
possible for all M unless x = const, with probability 1, 

Martingale theory can be used to lay the basis for the study of the 


convergence of a series > y; of mutually independent random variables. 
1 


Rather than showing how this can be done in detail we prove a funda- 
mental theorem on convergence, by martingale methods, and then go on 


to apply Theorem 4,1 to the partial sums of the series 2% In the 


following we shall suppose for simplicity that the y,'s are real, 

Let y, have characteristic function ®,, and suppose that for some 4 no 
(2) vanishes, Then it is easily verified that, using this 4, the 2, process 
defined by 

a f 7] 


is a martingale. It was shown in II §2 that, if the infinite product 
TT ® is convergent on a A set of positive Lebesgue measure, then the 
series Š y, converges with probability 1. This result, which implies that 
convergence in measure or in the mean of the series X y, implies con- 


vergence with probability 1 (see Corollary 2 to IH, Theorem 2.7), will 
now be derived by an analysis of the #, process. Dropping some of the 


first terms of the series Ey if necessary, we can suppose that 


336 MARTINGALES vil 


P [O| >-4 on a A set A of positive Lebesgue measure. Then, if 


ae A, |@,(@)| <2, and therefore lim č, exists with probability 1, by 


n> 


Theorem 4.1. Consequently 


Ady, 
lime + 
n>a 
exists with probability 1 for each A in A. Then, except for an w set of 
probability 0, 
a ¥ yo) 
lim e = f(A, wo) 
no 
exists for almost all 4 in A (Fubini’s theorem). To finish the proof we 
o 
show that, if œ is not in the exceptional set, > y;(w) converges. To 
1 


show this let A, be any Lebesgue measurable subset of A of finite positive 
measure. Then 


in fe? Sis a J 10, oo) a2. 
Ay 


n>a 4, 


It follows that, if lim sup | > ¥,(@»)| = 00, then the right side must vanish 
n>n g 


(for every A,) by the classical Lebesgue theorem on trigonometric integrals. 
But then f(A, o) = 0 for almost all 4 e A, and this contradicts the obvious 


fact that |f| =1. Then lira n sup | > yoo) <0, If sı and s are 


unequal limiting values of the ‘partial sums of the series > Y(@o) it follows 


that sı and s, are finite and 
eii — pitts 


for almost all e A. But this equality is impossible for two values of 2 
whose ratio is irrational. Ente there is only a single limiting value of 


the partial sums of the series Sulo that is, the series converges, as 
was to be proved. 
Next we apply Theorem 4.1 to the x, process defined by x, = =a 


assuming that Efy,} = 0 for j> 1, so that the a, process is a fhareinpale: 
By Theorem 4.1 (i), 
E{|x,|} < Efla < °° +, 


2 y; converges with probability 1 if lim E{|x,|}— K < œ, and in that 


§5 SUMS OF INDEPENDENT RANDOM VARIABLES 337 


case E{|x,.|}< K. The following theorem provides a converse to this 
result which is not true for all martingales. 
THEOREM 5.1 Let Yı, Yə, * * * be mutually independent real random 


variables with Efy,} = 0 for j> 1, and suppose that > y; converges with 
1 
probability 1 to a sum x,, with Ef{|w,,|} < 00. Then E{x,,} = E{y,} and, if 
Tn = È Yn 
4 

E{L.U.B. |z,|"} < 8E{|x,,|"}, «21. 

The right side may of course be infinite if «> 1. If«> 1, Theorem 
3.4 is applicable, and implies (5.1) for « > 1.3. We therefore need only 
prove (5.1) for « < 1.3, but this restriction on « will not be made until 


it is stated explicitly. First suppose that the y,’s are symmetrically 
distributed. Then by a trivial extension of III, Theorem 2.2, 


P{L.U.B. x,(w) > 1} < 2P{x,,(w) > A} = P{|x,.(o)| =4}, 40, 
and it follows that 


E{L.U.B. |æ} = f P(L.U.B. |2,(o)|* = Adi < 2 | Piles = aaa 


0 


bey 2E{ {2.17}. 


In the unsymmetric case let 7,*, yo*, * * + be random variables independent 
of each other and of the y,’s, and let y„* have the same distribution as 
n o : 


Yn for every n. Let £,” = >y;*, %* = 2 y;*. Since the (y; — y;*)'s 
1 1 
are symmetrically distributed, we have 


(5.2)  E{L.U.B. |x, — tp" |} < 2E{|. — tot} < 22.0%}. 


Next we show that 
(5.3) Efta [ty °° SEn = Pn 
In fact, 
(5.4) Ett, lto" °° %n} = En — Tn |2" s Ea + En 
= Efe, — ty} + En = Efo) — Efys} + tn, 


and Efx„} = Efy,} because, when n—> oo, «, and the left side of the 
equation go to £e (by Theorem 4.3, Corollary 1). Thus (5.3) is true and 


338 MARTINGALES Vil 
(Example 1 of §1) the process {z,, 1 < n < œ}is a martingale. Ifa >1 
we have, from Theorem 4.1 (iii), 
Efel} < Ele ++ lim Efle,|"} = Ele). 
Now, if « > 1, 
AeA A a oe iik et 
m 
Hence . 
jar,|* << 2-2 E{L.U.B: |2,, — m*|* | 21, Ze °° “} + 27-1 Efe) 
m 
so that, using (5.2), 
(5.5) 
E{L.U.B. |xn|} < 2*4 E{L.U.B. |x, — ©,,*|"} + 2° L.U.B. E{|2,|7} 
n m m 


<P Ef|eo|} + 2A Efel) 

SER ae aa: 
Theorem 4.1 gave general convergence criteria for martingales 
{x,,n > 1}. In particular, suppose that w, = 3 vs where the y,’s are 


mutually independent, and Efy;} = 0,7 > 1. Then we have two theorems 
which allow the strengthening of Theorem 4.1: The zero-one law (III, 
Theorem 1.1) implies that the sequence {x} converges to a finite limit with 
probability 0 or 1; Theorem 5.1 states that in case there is convergence 
with probability 1, to the limit w,, then, if E{|x,,|} < 00, it follows that 
E{L.U.B. |x,,|} < œ. This means among other things that parts (i) and 
(ii) of Theorem 4.1 coalesce in this special case. The strengthened 
version of Theorem 4.1 in this special case is the following: 
Let Yı, Yz * * + be mutually independent random variables, with 


Efu} <% jel 
r EG} O yi 
Then, if x, = 2 Yis 
Efel Ele a>. 
(i) if lim E{|x,|} < co for some a = 1, then lim Ly, = ta exists with 
probability 1, Ef (LU. Ba, im Efe, — “ta? }=0, and the 


Sequence {X,,, < ©} is a martingale. Conversely, if lim x, = x,, exists 
with probability 1, and if, for some «>l, E{x.|"} <œ, then 
lim E{{z,|"} < œ. 

no 


§5 SUMS OF INDEPENDENT RANDOM VARIABLES 339 


(ii) If the y;s are real, and if 
E{L.U.B. y,,} < œ, 
n 


then lim x, exists and is finite with probability 1 if lim sup x, < œ with 
positive presencia) aig 

(iii) If 
EO * E{L.U.B. |y,|?} < œ, 


then lim x, exists and is finite with probability 1 if and only if 


n> oO 


> E{lynl?} < 00. 
1 


Parts (i) and (ii) need no further comment. Part (iii) states that if 
(5.6) is true then > y; converges with probability 1 if and only if there is 
1 


convergence in the mean. The “if” half of this statement is not inter- 
esting, because according to the Corollary to III, Theorem 2.7, convergence 
in the mean of any series of mutually independent random variables 
implies convergence with probability 1. The “only if” half generalizes 
Ill, Theorem 2.4, which replaces (5.6) by the stronger hypothesis that the 
y;s are uniformly bounded. 

Before continuing, we remark that, if z, and z, are mutually independent 
random variables, then, if E{|z, + 2|} < 0, it follows that E{|z,|} and 
E{|z0|} are also finite. It is sufficient to prove that E{|z,|} < 00, and this 
follows from the inequality 

Ela tal}> f ial a dP = Pilao) < Ella — al. 
{lza(m)| <a} 

THEOREM 5.2 Let Yy, Ya * * * be mutually independent random variables, 

and suppose that > y; converges with probability 1 to a sum £a with 
1 


E{|x,,|°} < 00 for some o= 1. Then E{|y;|*} < © for all j, 
Biz.) = SEs, 
6.7) E{L.U.B.| È y} < 827 Efla. |} +2 LUB. | X Eyl 
n 1 n 


and 
(5.8) lim E{|x,, — Zuil =0. 
n> 


340 MARTINGALES VII 


The last statement, that the partial sums converge in the mean with 
index «, follows from the fact that according to (5.7) the quantity in the 
brace in (5.8) is dominated by a function with a finite expectation, and 
(5.8) will accordingly not be mentioned further. Since the y,’s are 
mutually independent, and since x, =y, + > Y; with E{|w.,|} < oo, it 

j#n 


follows that E{|y,,|} < œ also. Let #, = $ y; Then (5.4) becomes 
1 


E{z,, [%,° * +, 7,} = Efz.}— Efa,} + Ea 
We now find when n —> œ that 
Efx} = lim Efz,}. 
It follows that Se 
Wy, — Ely = ze — Elta} 
Then, by Theorem 5.1, 


E(LU.B. | 5 [yy — EMY < Elle — Efe} 
so that 
E(L.UB. | $ y} < 8:2 Bile, — Efe} 
n 1 


; 
+ 24 L.U.B. | $ Eyl" 
n 1 
< 8:21 Bf, |} + 2 L.U.B. | X Ely}. 
n 1 


This completes our discussion of the application of martingale theory 
to the study of series of mutually independent random variables. It is 
significant that, even though the general martingale convergence theorems 
can be strengthened when martingales defined by the partial sums of 
mutually independent random variables are studied, martingale theory is 
useful in this very strengthening. 

The main stress in this section has been on the application of Theorem 
4.1. Many of the results could have been obtained, however, by applying 
Theorem 4.2 to the martingale with random variables - - +, z_, 7, where 


Go o n—-1 
Z n= ED YN Yn Ysa J= Dy; + Ef È y). 


(Here we are assuming that > y; converges with probability 1 and that 
J 


the expectation of the sum exists.) 


§6 THE STRONG LAW OF LARGE NUMBERS 341 


6. Application to the strong law of large numbers 


Let y,’, Yz, `` be mutually independent random variables with a 
common distribution function, and suppose that E{|y,’|} < oo. Then 
Wel Spee OEE ’ 
(6.1) lim ach = Ey} 


with probability 1. This theorem was proved in III (Theorem 5.1) as an 
application of a theorem on the convergence of an infinite series of 
mutually independent random variables. It can also be deduced as a 
special case of the strong law of large numbers for strictly stationary 
processes (X, Theorem 2.1). The wide sense version of (6.1) is the 
following. Let Y, Ys, ++: be an orthonormal sequence of random 
variables. Then 


(6.1’) Lim. GEVIER 


n=>o n 


The wide sense version is trivial because 


EÍ] Yit: Yn ia : >0 (n> ©).. 


n 


In IV §7 a proof of (6.1’) was given as an application of the convergence 
theorems for wide sense martingales. We now show how this proof can 
be translated into a proof of (6.1) by replacing the wide sense concepts 
by strict sense concepts. 

According to Theorem 4,3, Corollary 1, if y, = ie a a) A 


Bon Etyy |Yn Yno" * °} = to 
exists with probability 1. Now it is clear that 
Et! | Yn Yno * } = EYL | Yn Ynyr» Yna * J 
and, since Yp+1» Yni2»’ * * are independent of the pair Yi, Yn 
Ef | Yn Ynis ©} = Elen’ | Yah 
with probability 1. Thus 
lim Ely,’ PERET 


with probability 1. 


342 MARTINGALES VII 


Finally, by symmetry, 
Efyy (y +: e Hyn} = Ely ye! +++ HY} jan 


Le Biren ; 
=— & Ey [yy EEEE 
Nn j=1 


1 = , , r 
ZRC > yy lu Ee es 
nj=1 
Tis Ya, 


so that (6.1) is proved except for the identification of the limit. To 
identify the limit we note that in the first place x_,, is unaffected by 
changes in any finite number of y,’s and therefore, by the zero-one law 
(III, Theorem 1.1), w_,, = const. = Efa_,,} with probability 1. In the 
second place according to Theorem 4.4 the sequence 

gs? + s Ef [Yas Ya h Ely Ita Yoo ba’ 
is a martingale and therefore E{x_,.} = E{y,’}. 

It would be interesting to generalize the above proof to obtain a 
martingale proof of the general case of the strong law of large numbers 
for strictly stationary processes (X, Theorem 2.1, ergodic theorem), but 
no such proof has ever been given. Wiener has pointed out that the 
ergodic theorem is really an integration theorem, closely related to the 
fundamental theorem of the calculus. Martingale theorems are also 
essentially integration theorems, although in a somewhat different context. 
In fact, the equation 

Eft, | %,° +, Ena} = Aq, 


whose validity with probability 1 is the defining property of a martingale 
{£m n = 1}, shows that «,_, is obtained from x, by integrating out one 
variable. More insight in this direction will be given in the following 
section. 


7. Application to integration in infinitely many dimensions 


The significance of the martingale convergence theorems as integration 
theorems is exhibited very clearly in the following examples, first discussed 
(from another point of view) by Jessen. Let the basic w space be the 
space of infinite sequences (7;, N2 `- -), O< n; < 1, and let the given 
probability measure be infinite dimensional Lebesgue measure. Then, if 
Y; is the jth coordinate variable, 4}, ya, - + + are mutually independent 


§8 THE THEORY OF DERIVATIVES 343 

random variables, each uniformly distributed over the interval [0, 1]. 

Let z be a random variable with E{|z|} < oo and consider the integrals 
&_(@) = 

(7.1) f 

fem Ne») Ener Masa = Efe | Yi * `s Yn 


0 


1 1 
ral ¢ - fen, Me ` Xidm + + dpa = EZ Ym Yni} 
t Fi 
cd) SA 
0 ) 


The integral forms make it intuitively reasonable that 


lim 2x, = Efe} 
(7.2) eS 

lim x, = 2 

n>o 
with probability 1. Actually the expressions for the «,,’s as conditional 
expectations show that the sequences * > *, Vs, Y and 2, ta ** * are 


martingales. Hence (Theorem 4.2) lim , = _,, with probability 1 and 
n> -o 
(Theorem 4.3, Corollary 1) lim x, = z with probability 1. It remains to 


identify x_,, with Efz}. Since w_,, is independent of y1, * ` *, Yn for every 
n, £ = E{x_,,}, with probability 1 (zero-one law), and by Theorem 4.2 
Efx} = E{z}, so that v- = E{z} with probability 1. 

Note that the results (and proofs) are valid whenever the y;s are 
mutually independent; the hypothesis of a common distribution function 
(with constant density) only served to simplify the integration notation, 


8. Application to the theory of derivatives 

Let Q be an abstract space, and let P{-} be a probability measure of w 
sets, as usual. -For each n let Mg", M,™, + » + be finitely or denumerably 
infinitely many disjunct measurable œ sets, with union Q. Let Zn be 
the Borel field of œ sets which are unions of M,'"”’s (fixed n). The class 
UF, isa field. Let F. be the Borel field generated by this field. We 
suppose that each M,'"+?) is a subset of some M,™. Then 

FCF se ix. 

Let @ be any function of sets of the field UF,, completely additive on 


F „ for each n, and define the œ function x, by 


mM ; 
ED mo oar oeM™, if P{(M;™}> 0, 
3 


a; oeM; ™, if P(M,™}= 0. 


344 MARTINGALES Vil 


Then, if m < n, it is trivial to verify that 
Eft, | Fin} = Xm 


with probability I, so that {z,, F „, n > 1} is a martingale. The present 
section is devoted to various applications of this fact. Note that the 
results are not really limited to cases where the basic measure is a proba- 
bility measure, since any finite-valued measure (not identically 0) can be 
reduced to a probability measure by the introduction of a suitable pro- 
portionality factor. 

In the following we shall say that we are in the Lebesgue case if Q is 
the linear interval [0, 1], if the measurable w sets are the Lebesgue 
measurable subsets of [0, 1], and if P measure is Lebesgue measure. 

Example `l (absolutely continuous ø) Suppose that œ is absolutely 
continuous, that is, that there is an F„ measurable w function x, with 
E{|x|} < co, such that 

xA) = [xap, AEF. 
A 
Then it is trivial to verify that 


Efe | Fn} = tn 


with probability 1. Then in this case the random variables x, Xo, © * *, © 
form a martingale, and, according to Theorem 4.3, 


(8.2) lim 2, = Efe: |F} =< 
n> 
with probability 1. In this case, according to Theorem 3.1 applied to 


the semi-martingale |x|, ||, - - -, |x|, the x„’s are uniformly integrable. 
Conversely, if the x„’s are uniformly integrable, then lim a, = 2,, exists 


i a i n*o 

with probability 1, lim E{|x,, — a,|} = 0, and the process {x,,, 1 < n < %0} 
n>n 

is a martingale, by Theorem 4.1 (ii). But then, if we define p, by 


P(A) = | x, dP, 
we have (martingale property) ‘ 


gA) = | 2, dP, NeUF,, 
À m 


for n so large that A eF „, so that when-n —> œ we find that 
A= pA), AcUF,,, 
m 


This equality is therefore true for A e F „ (see Supplement, Theorem 2.1). 
We have thus shown that the «,,’s are uniformly integrable if and only if 


§8 THE THEORY OF DERIVATIVES 345 


g is absolutely continuous (relative to F). If y is defined on the field 
F of all measurable w sets, and is absolutely continuous relative to F, 
with density function X[z] relative to. F [F], then 


E{XX|Fj=2, HX NF) = Ea Fn) ==, 


with probability 1, and the above argument remains valid in every detail. 
Finally, v = X with probability 1 if and only if X is equal almost every- 
where to a function measurable with respect to F. For example, in 
the Lebesgue case if the M,'””s are intervals, and if ô, is the least upper 
bound of the lengths of M,™, M,'”,- - -, then, if lim 6, = 0, it is 
clear that F contains every subinterval of [0, 1], and therefore every 
Borel subset of this interval. It follows that in this case v = X with 
probability 1. 

Example 2 (singular y) Suppose that ¢ is singular, that is, we suppose 
that @ is defined on the sets of F, that œ is completely additive, and 
that there is a M ¢F.,, the singular set of p, such that (A) + 0 for some 
ACM, Ae¥,, and that 


P{M} = 0 
g@(A)=0, ACQ—-M (AcF,). 


Suppose first that g is non-negative. Then the «,’s are non-negative; so, 
by Theorem 4.1 (i), lim x, = X exists and is finite with probability 1. 


no 


By Fatou’s lemma, 


n>a 


[x dP < lim inf | x, dP, 
A A 


if A is measurable. On the other hand, 
o(A)=[tydP, AeFm msn. 
A 

Then combining these two relations we find that 

fz dP<q(A), AUF, 

A n 
Since F „ is the Borel field generated by the field OF, ny this inequality 
must be true for A eF „, since it is true for A e y F „ (see Supplemen 


Theorem 2.1). But then, if M is the singular set of p, 
f ze dP < Q- M) =0, 
O-M. 


346 MARTINGALES VIL 


so that (since x,, > 0) x = 0 with probability 1. If p is not necessarily 
non-negative, it can be expressed as the difference between two non- 
negative singular functions so that again we deduce that lim x, = 0 with 
probability 1. i 

Example 3 (p of, bounded variation) Suppose that ¢ is defined on the 
F „ sets and is completely additive. We assume, as we always do in 
this book in considering set functions, that ọ is finite-valued. Then 
according to a standard theorem of the theory of set functions is of 
bounded variation, That is there is a K such that, for any disjunct ¥,, 


1 Ay Ap > 4 
eee SoA] <K. 
j 


Then 
E{lx,|} = È |o(Mj™)| < K, 
Ei 


so that, by Theorem 4.1 (i), lim x, =a, exists and is finite with | 


n= 
probability 1. The function æ, is called the derivative of p with respect 
to the given probability measure, relative to the net of the M;™’s, and 
we have thus proved that every completely additive set function has a 
derivative with respect to a given measure, relative to a given net. Since 
g is the sum of an absolutely continuous and a singular function, relative 
to F,,, the results of the preceding examples identify the derivative x, 
as the density of the absolutely continuous component of ¢ relative to 
Fp. If gis defined on F, this derivative is the density of the absolutely 
continuous component of ¢ relative to F if and only if the net is fine 
enough, more precisely if and only if that density is equal almost every- 
where to a function measurable with respect to F and if the singular 
component of ¢ relative to F has an F, set as singular set. 

Example 4 (Lebesgue case) In the Lebesgue case, for each n let 
O= m <- +--+ <€,(" = 1, where each £+ is a &,(", and define 


My ey M,”) = [0, &™)], j=0 
= (E, Fy), j>0. 
Let F be a numerically valued real function on [0, 1], and define 
saa M,™) = p> [FE ja) — FE;)]. 
Then the definition of x„ becomes 


FE 1) — FE) 


2, (6) = Ee M;™. 


Eja 


§8 THE THEORY OF DERIVATIVES 347 


We suppose in the following that 
lim Max (E54 — &™) = 0, 


no j 
so that F,, is the class of Borel subsets of [0, 1]. Suppose that F is a 
function of bounded variation, and assume as known the fact that the 
derivative F’ exists almost everywhere, that the absolutely continuous 
component F, of F is the integral of F’, 


F=F+Fy F= |E) dn, 
0 


and that the singular component F, of F’ has derivative 0 almost every- 
where. Then, according to Example 3, 

lim 2, =F’ 

n>a 
with probability 1. On the other hand, the results of Example 3 can also 
be used to derive the above-stated facts about the existence and significance 
of F’. We omit a discussion of this point. Now let R be the set of 
&™s, j,n >0. According to Example |, if p is absolutely continuous 
(which is equivalent to the condition that F coincides on R with an 
absolutely continuous function defined on [0, 1]), the x,’s are uniformly 
integrable, and conversely, if the x„’s are uniformly integrable, F coincides 
on R with an absolutely continuous function defined on [0, 1]. Interesting 
pathological examples of martingales can be obtained by choosing F to 
be a singular function. The following is such an example, in which F 
is constant except for a jump at a single point. Let F be defined by 


FE =0 OSt<} 


=i, E Ea 
and set : 
E) = sar pas Ofer Zoe 
Then 
z,(&) = 0, ostsin 
= 2m, 5- mi <t<} 


348 MARTINGALES VII 


The process {x,, 1 >1} is a martingale, the x,’s are non-negative, 

E{x,} = 1, and lim x, = «,, = 0 with probability 1;. in fact, E= ġ4 is 
no 

the only point where there is not convergence to 0. The process 

{æ 1<n< ©} is not a martingale, since E{x,} A Efz,,}. This is the 

first example we have given of a martingale t4, to, © + - with limit 2a, for 

which 24, 2, * * *, % is not also a martingale. 


9, Application to likelihood ratios in statistics 


We now consider II §7, Example 3: Y}, Y2 * * * are random variables; 
the distribution of y}, * * *, Y» is determined by a Baire density function 
Pal,- + +s"); OF alternatively by q,(,. + -> °). Define x, by 


MY Yh) 
PY? is Yn) 


Then, if Yp Yə ` * * are random variables with distributions determined 
by the p,’s, x, is a random variable, defined with probability 1. We have 
seen in II §7 that the sequence 2, ta, > - is a martingale if q,(é,° *-*, Ex) 
= 0 whenever ppl * « *, &,) = 0. We do not make this hypothesis 
here; without it the discussion of II §7 proves that the sequence is a lower 
semi-martingale, that is, that — 2, — tə * * is a semi-martingale. Since 
the — a;’s are non-positive, Theorem 4.1s states that 


n 


(9.1) lim 2, = fa 
nw 
exists with probability 1 and 
OD 1>SE@}>Efm}>---, Efe) lim Efe}. 


n=» 


These two statements can be considered a generalized maximum likelihood 
statement. In fact, the general idea of the principle of maximum likeli- 


hood in Statistics is that, if y,, * * +, Y„ probabilities are determined by 
the density p,(-,.. .,°), then a, < 1, that is, 

(9.3) GnYa © s Yn) S Pas" * > Yn) 

in some ayerage sense. Besides (9.2) the inequality 

(0.4)  Eflogz,} < log E{z,} = log E{q,/p,} < log 1 = 0 


is an expression of this fact, and this inequality is fundamental in the 
study of the consistency of maximum likelihood estimates in statistics. 
The idea of (9.3) is sometimes expressed by the statement that, if 
[y (W), * * +, ¥,(@)] is a set of sample values, these actually obtained values 
are more probable when calculations are made in terms of the correct 
density than when calculations are made in terms of any other. 


§9 LIKELIHOOD RATIOS IN STATISTICS 349 
If proper hypotheses are imposed on the p,s and q,,’s, x, for large n 
will exhibit this tendency to be <1 in a striking way. For example, 


suppose that the p,’s and q,’s correspond to independent y,’s with 
common distributions, that is, 


PAYw* * Yn) =| | ri) 
1 

CACA] 

ae = Theo 


Ia Yd =] | a) 
1 


We prove that in this case 


lim z, = 2, = 0 
with probability 1, unless p,(£) = 9,() for almost all values of ¢ (Lebesgue 
measure), that is, unless the p, and q; distributions are identical. This is 
a result of precisely the type desired. 

(a) We prove first that x,, = 0 with probability 0 or 1. In fact, from 
the form of the infinite product defining æ, it is clear that, if x,,* has the 
same distribution as x,, and is independent of £o, then w,,%,.* also has 
the distribution of x,,. Then 


(P= eee =0}=1—(1— n? 
so that 
n=0 or n=1. 


(b) If zo =0 with probability 0, we shall show that x. = 1 with 
probability 1. In fact, in this case log (x,,%,,*) = log £a + log w,,* has 
the same distribution as log v and loga*. Then, if ® is their common 
characteristic function, ®? = ®, which implies that ®(t) = 1 [since 
(0) = 1 and ® is continuous], so that log z, = 0 with probability 1. 
It follows that w,, = 1 with probability 1. 

(c) Finally it is clear that if t = 1 with probability 1, 


(pees HY) _ 
PU) 
with probability 1 also. Hence q,(y;)/piy) = 1 with probability 1. This 


means that g,(€) = p,ı(&) almost everywhere where pilé) > 0, that is, 
almost everywhere on the é-axis since both functions have the integral 1 


over the whole axis. 


350 MARTINGALES Vil 


We have thus verified the general principle underlying the method of 
maximum likelihood for mutually independent random variables with a 
common distribution function: unless the two distributions are identical, 
the likelihood ratio x, converges to zero with probability 1 when n — co 
(computing probability by means of the p, distribution). A statistician 
can use this principle as follows. He chooses a positive constant a and 
decides that, if a given sample y,(œ), - * *, y,() makes x,(w) > a, he 
will act as if the p distribution were the correct one; otherwise he will 
act as if the g distribution were the correct one. We have just proved 
that if is large the statistician will act incorrectly with small probability, 
and in fact that in. a sequence of trials there is probability 1 that the 
statistician will finally stop making mistakes. (In statistical language, his 
procedure is “consistent.”) Ordinarily there is a whole family of distri- 
butions to choose from, q(-) = q(9, :) depends on the parameter 0, one 
value of which is known to give the correct distribution 9(, *) = p(-). 
The method of maximum likelihood, given a sample y,(@), © * +, ¥,(@), 

n 


chooses 0,,[y,(@), * * *, Y„(%)] as that value 0 maximizing T] 9, y;(w)], 


1 
if there is one. According to the theorem just proved, as long as 0 is 
chosen so that 


n n 

TT A10, wl = a T T al, yoy], 

1 1 

where a is a fixed positive constant, we cannot keep on picking the same 
incorrect value of 0. Further hypotheses must be imposed, however, to 
insure that 0„ —> 0) with probability 1. 


“10. Application to sequential analysis 

Let Yı, Yə, * * * be mutually independent real random variables, with a 
common distribution function, and suppose that Efy,} = a exists. Let m 
be an integral-valued random variable, with the property that for each k 
the condition m(w) = k is a condition on only the first k y,’s, that is, the 
@ set {m(w) = k} is determined by conditions on y}, * * *, Yp Define x’ 
by 

w= + yy 


Then it is important for certain problems in sequential analysis to find 
conditions under which 


(10.1) Efx} = aE{m}. 
This is easily solved by martingale theory. In fact, if a, is defined by 


i= 2 O= a) = Zy na, 


$10 SEQUENTIAL ANALYSIS 351 


we have seen that the sequence 2,, Xa, * + + is a martingale. According to 
Theorem 2.2, if Efm} < œ, and if for some constant K, 


(10.2) Effen tnl lto s En} = Ellyn al} << K, n< mlo), 
with probability 1, then the “sequence” 2,,2%,,, obtained by optional 
sampling, is a martingale, with E{x,} = E{a,,}. In the present case (10.2) 
is certainly satisfied with K = E{|y,— a|}. The equality E{x,} = E{z,,} 


means that 
0 = Efa,} = Efx’ — ma} = E{x’}— aE{m}, 


the desired equation. The variance o” of x’ is found as follows, The 
sequence 


{Oy;— na? —no?,n=1,2,-- +}, & =E{y;— a}, 
£ 
is easily seen to be a martingale, if Ef|y,|?} < 00, and in fact 
» n 
KO YaNa T EN is Yaka CO Yy Ma R 
1 


with probability 1. Let F, be the Borel field of the w sets determined 
by conditions on %, * * *, Yn» Then we have proved that 


(Syy— na? not, F yn > 1} 
1 


is a martingale. The condition C, of Theorem 2.2 becomes, as applied 
to this process, 


n 
E{|Gnia — Alyn — 4+ 2 2 (y;— a)]— o°] |F} K, n<m(o). 
This will surely be true for properly chosen K if Effy} < œ and if 
ja,| = | $ 0 — a| < const, n< mlw). 
1 


The condition is satisfied, for example, if m(w) is defined as the first 
integer j with |a,(@)| > K, and if |y,|< Kg, where K, and K, are given 
constants. 

The above type of argument, already applied to find the expectation 
and variance of x’, can be used to find all the semi-invariants of the af 
distribution. Rather than doing this, however, we shall show how the 
argument also yields the fundamental theorem of sequential analysis. 


Define ®(z) by Diz) = Efe} 


352 MARTINGALES Vil 
for z complex. Then the sequence i, ta, - * *, Where 


(§) 
zí ly; 
e 1 


PO, 


u, 


is a martingale, for any value of z for which ®(z) exists. In fact, ifn <v, 


by, fen 
e Efe”"* 
Efu, | Fn} = Efu (Yo I = Ge Oa — Mv 


with probability 1. Hence {u,, F, n= 1} is a martingale. Then, if 
Theorem 2.2 can be applied, 


m 
zly 


1 = Efu,} = Efun} ele} 


and this is the fundamental theorem of sequential analysis. The condition 
C; of Theorem 2.2 becomes here: there is a constant K such that 


ntl 
z Ey n 

x ¢ Sees |< xlo" < mlo) 
(Sh asta) Z D 
to [HAS Soi c= » n<mo), 


with probability 1, that is, 


2By | pt 
le? eé- i= eor, n < mw), 


with probability 1. If there is a value of z for which ®(z) is defined, 
with |®(z)| > 1, and if the real part of za, is < K, for some K, when 
n< m(q), then this condition is satisfied. 

Throughout the above discussion it was supposed that the y,’s have a 
common distribution function, but the methods are obviously applicable 
in the general case. 


11. Continuous parameter martingales 


In the previous sections we have for the most part restricted our atten- 
tion to discrete parameter martingales and semi-martingales, although the 
definitions of these types of families of random variables only require that 
the parameter range T be an ordered set. In the present section we shall, 
as always, suppose that T is a linear point set which may include the 
points +, but shall usually impose no further restriction. The 


$11 CONTINUOUS PARAMETER MARTINGALES 353 


theorems discussed will include the theorems on sequences of random 
variables proved in previous sections as special cases, but the point of the 
section is the application to the continuous parameter case, in particular 
to the case in which T is an interval, and this explains the section title. 

The extensions of the discrete parameter theorems of the previous 
sections will be identified only as the general parameter set versions of the 
corresponding discrete parameter theorems. Some earlier theorems are 
not given general parameter set extensions, because either the extensions 
are uninteresting, unnecessary (because the earlier theorems impose no 
restrictions on the parameter sets, for example Theorems 1.1 and 3.1), or 
unknown. On the latter possibility we remark that it is not known 
under what circumstances the representation (1.5’) of the general random 
variable of a semi-martingale as the sum of the general random variable 
of a martingale and a partial sum of a series of non-negative random 
variables is valid when the parameter set is an interval. 

We defer temporarily the general parameter set versions of the game 
theorems of §2. 

(THEOREM 3.2) Jf {x t eT} is a separable semi-martingale, if A is real, 
and if T has a maximum value b, then 


IP{L.U.B. d0) >< [| dP <E{lx,)}, 
tel 


{L.U.B. x(w) >A} 
teT 


AP{G.L.B. a (w) < 3} > i. x, dP — E{a,} + G.L.B. E{x,} 
ase (eee 2m) <A} teT 


> G.L.B. Efx} — E{|a,|}. 
te? 


To prove the theorem we remark first that by Theorem 3.2 these 
inequalities are true if T is replaced by a finite subset of itself, including 
the point b. Using the obvious limiting procedure, we then see that the 
inequalities are true if T is replaced by an enumerable subset of itself 
including the point b. Finally, the theorem is true as stated, because by 
the definition of a separable stochastic process (see II §2) there is an 
enumerable subset S of T with the property that (neglecting an œ set of 
probability 0) a sample function has the same bounds on T as on 5. 
Since the theorem is true when T is replaced by S (which we can assume 
contains the point b), it is true as stated. Note that the second inequality 
will be trivially true if G.L.B. Efx, =— 00. If T has a minimum value, 
a, we have tee 

G.L.B. Efx} = E{,}; 
te? 


354 MARTINGALES VII 


if T has no minimum value, we have 


G.L.B. Efx} = lim Efx}, a= G.-L.B. t. 
teT toa tel 

(THEOREM 3.4) This theorem is generalized exactly as was Theorem 
3.2, and its statement will be omitted. 

(THEOREM 4.1) Part (y) does not seem to have an interesting general 
parameter set extension. Part (iv) will be extended later. Parts (i), (ii), 
(iii) are easily extended, and we give the extension of (i) to exhibit the new 
form of the statement. Let {a,, te T} be a martingale. Then E{|zx,|} is 
monotone non-decreasing in t. Suppose that b = L.U.B.1 ¢T. 


à teT 
(i) If 
lim E{|x,|} = K < œ, 
tb 


then there is a random variable x, with E{|x,|}< K, such that, if {s,} is 
a sequence in T, lim s, = b implies that lim x, = x, with probability 1. 
n> n> oO 


If the x, process is separable, this limit relation can be strengthened to the 
relation lim x, = x, with probability 1. In particular, K < œ if the xs 


sob 
are all real and > 0 or all real and < 0. 

In this statement b may be finite or infinite. To prove the existence of 
%, we note first that, if K < oo, and if {s,} is a sequence of parameter 
values converging monotonely to b, then the process {x, , n > 1} is a 
martingale, with E{|x, |}< K, so that lim 2, exists and is finite with 


no 
probability 1 by Theorem 4.1 (i). The limiting random variable must be 
independent of the monotone sequence {s,}, neglecting valyes on sets of 
probability 0, because any two sequences {s,} can be combined into a 
single one which corresponds to a sequence of random variables con- 
vergent with probability 1. Moreover, the limit must also exist with 
probability 1 if the sequence {s,,} is convergent to b, but is not necessarily 
monotone, because such a sequence can be reordered to be monotone. 
There thus exists an x,, defined in terms of sequential approach to b, as 
stated. If the x, process is separable, we have proved (II, Theorem 2.3) 
that sequential approach can be replaced by ordinary approach. The 
remaining statements in the general parameter set version of Theorem 4.1 
(i) are proved in exactly the same way as in the discrete parameter case. 

(THEOREM 4.1s) This theorem is generalized just as Theorem 4.1 is, 
and the proof is again carried through by reduction to the discrete 
parameter case. 

(THEOREMS 4.2, 4.2s) The generalizations are carried out in the same 
way as those of Theorems 4.1 and 4.1s. 


$11 CONTINUOUS PARAMETER MARTINGALES 355 


(THEOREM 4.3) Letz be a random variable, with E{\z|} < œ, and let 
F , be a Borel field of measurable w sets for each t in the linear set T, with 
F CF, fors <t. Define 


a= G.L.B. t, b = L.U.B. t 


teT teT 
I es 
F at = 0,9 t 


and let F,,_ be the smallest Borel field of œ sets with F, -D UF, Then 
te? 


the conditional expectation E{z | F ,} can be defined for each t e T in such 
a way that 

lim Efe |F} = Efe | Fo} 

toa 

lim Efe |F} = Ef | F,-) 

tol 
with probability 1. 

To prove this theorem we remark that the limit equations are true, by 
Theorem 4.3, if ¢ goes to its limit along a sequence of values. Moreover, 
the stochastic process {E{z |.F,}, t € T} can be made separable by a proper 
choice of the conditional expectations, since each can be changed arbi- 
trarily on an @ set of zero probability. If a choice making the process 
separable has been made, the sequential approach of £ to its limit is no 
longer necessary, by II, Theorem 2.3. 

The fact that the stochastic process of this theorem can be defined to 
be separable means that the sample function properties of Theorems 11.2 
and 11.5 are true of this process when so defined. 

Example 1 Let {x,, n =0} be a martingale, and define a process 
{zn 0< t < oo} by 


EPE n<xt<n+1 (n = 0). 


Then the a, process is a separable martingale. The sample functions have 
no discontinuities except at integral values of £, but will be discontinuous 
at such a value ¢ = n unless 2,() = x,,_,(@). 

Example 2 Let fx, 0 < t < 99} be a process with independent incre- 
ments, with ` 

E{x, — a} = m(t). 
Then the process 
{a,—a%— m(t), 0< t < oo} 


is a martingale. This is the type of continuous parameter martingale 
which corresponds to the discrete parameter type defined by the partial 
sums of a series of mutually independent random variables with zero 


356 MARTINGALES Vil 


expectations. In the particular case of a Poisson process (see II §9 and 


VIII §4) 
m(t) = const. t. 


Now in this case it is shown in VIII §4 that, if ao(w) = 0, x, process 
sample functions can be used to represent the number of events of a certain 
type that have occurred between times 0 and f: almost all sample functions 
are monotone non-decreasing and increase in unit jumps, if the process is 
separable. Moreover, for each ż, 

limt, = 2; 

sot 
with probability 1, that is, the jump points vary from sample function to 
sample function in such a way that, although the probability of a jump at 
any particular value of ż is 0, the probability of a jump somewhere in an 
interval is positive. 

It will be shown below that, as far as continuity of sample functions is 
concerned, the two preceding examples are characteristic of the general 
case. Almost all sample functions of a separable martingale are con- 
tinuous except for discontinuities at which both left- and right-hand 
(finite) limits exist, and there are at most enumerably many parameter 
values where the probability of a discontinuity is positive. The Brownian 
motion process (II §9 and VIII §2) is a non-trivial example of a separable 
martingale (a special case of Example 2) almost all of whose sample 
functions are continuous. 

THEOREM 11.1 Let {a,, tT} be a stochastic process, and let T, be a 
set of limit points of T. Suppose that, if t € T,, at least one of the stochastic 
limits 

plima,=2,, plima, =a, 
stt sit 
exists. There is then an at most enumerable subset Ty of T, such that, if 
t e T,— To, then both stochastic limits x,_ and æ, are defined, and 


Uy = By 
=a,  (ifteT) 
with probability 1. 

In the most important applications of this theorem T is an interval, 
and the stochastic limits x,_, x,, are known to exist at every point of T. 
The theorem then asserts that, for each ¢ not in some exceptional set 
which is at most enumerable, 

' a B= Ey = Hy, 
with probability 1. 


$11 CONTINUOUS PARAMETER MARTINGALES 357 


To prove the theorem define the distance between any two random 
variables x, y as the greatest lower bound of values of e for which 


P{|x() — y(o)] > e} < e. 
With this definition of distance d(x, y), lim d(x, x,) = 0 if and only if 


no 


plimx, =. Then, for each t e T, the random variable x, is a point of 
n> 


a complete metric space, so that the random variables of the x, process 
define a ż function f, t e T, with values in this metric space. By hypothesis 
JSit—) or f(t +) exists for every te T;. To prove the theorem it must be 
proved that the 7, set whose points are not limit points of T from both 
sides, or are limit points from both sides but are points where the oscilla- 
tion of fis positive, is an at most enumerable set. Itis sufficient to prove 
that the T, set 7,(m), the points of which are not limit points of T from 
both sides, or are limit points from both sides but are points where the 
oscillation of f(t) is > 1/n, is at most enumerable. If ¢ «T,(n) and if ¢ 
is not a limit point of T from both sides, it cannot be a limit point of T; 
from both sides. Hence there is an interval with ¢ as one endpoint, 
containing ¢ but no other point of T}. If t ¢ T,(n) and if f(¢—) [ft +)] 
exists, t is the right [left] endpoint of an interval containing ¢ but no other 
point of 7;(m). We thus obtain a set of intervals which we can suppose 
chosen to be disjunct, each containing a single point of 7,(n), and all 
points of this set are contained in the intervals. The set 7,(7) is therefore 
at most enumerable, because a set of disjunct intervals is at most 
enumerable. 

In the following, a point tọ € T will be called a fixed point of discontinuity 
of a stochastic process {«,, t e T} if it is false that whenever s, —> fo, 

lim x, = t 
nr 

with probability 1. If the process is separable, it follows that fy is a fixed 
point of discontinuity if and only if it is false that 


lim £, = %, 

aot 
with probability 1, that is, if and only if there is positive probability for a 
sample function discontinuity at % Any point of discontinuity of a 
sample function of a separable process, not a fixed point of discontinuity, 
will be called a moving point of discontinuity. Even if there are no fixed 
points of discontinuity, it does not follow that the sample functions of a 
separable stochastic process are almost all (that is, with probability 1) 
continuous functions, because there may be moving points of discontinuity, 
as in the case of the Poisson process (see Example 2 above). 


358 MARTINGALES VIL 


THEOREM 11.2 Let {x,, te T} be a semi-martingale, and let a, b be 
respectively the minimum and maximum values of the closure of T. Define 
T’ as the set of limit points of T, except that b is to be excluded from T’ 
unless b e T, and a is to be excluded from T’ unless 

G.L.B. Efx} > — œ. 
seT 

(i) To each point t eT’ which is a limit point of T from the left [right] 
there corresponds a random variable x, [x,,] such that, if s, >t with 
Sn < t [Sn > t] and sp € T, then 


with probability 1. If the x, process is separable, these sequential limits 
can be replaced by ordinary limits 
lim x, =t,- pim g 27,1; 
ate ale 
(ii) Except possibly for the points of an at most enumerable t set, for 
each t eT’ the following equation holds with probability 1 between as many 
of the three members as are defined: 


Li- =, = Ty. 


In particular, at most enumerably many parameter points are fixed points 
of discontinuity. 

Let te T” be a limit point of T from the left. The existence of x,_ as 
described follows from an application of the general parameter set version 
of Theorem 4.1s to the semi-martingale {x,, s e T, s < t} if we can show 
that, for some fy < t, 

L.U.B. E{|z,|} < oc. 
tas<t 
To show this let 4, be a point of T with ż4 >t. There is such a point 
since te T’. Choose fy any point of T with fọ < t. Then, by Theorem 
3.1 (ii), 
E{|x,|} < — Efx} + 2E{|x,|}, ash. 


The general parameter set version of Theorem 4.1s is thus applicable, and 
asserts the existence of x,_. The existence of x, is proved by applying 
Theorem 4.2s. Finally (ii) follows from Theorem 11.1. We observe that 
the point b could have been allowed in T’ also under the condition that 
E{|z,|} is bounded near b. 

Let {x,, t eT} be a semi-martingale. In §1, $2, and §3 we discussed 
various conditions under which the xs are uniformly integrable. The 
following theorem gives necessary and sufficient conditions. 


$11 CONTINUOUS PARAMETER .MARTINGALES 359 


THEOREM 11.3 Jn Theorem 11.2, 
Efx,} <Efz}, s<t, 
lim Efe} < Efe} < Efe} < Elen) = lim Efe. 
8 ate 


If t, € T, the x,s with t < t, are uniformly integrable if and only if 


Erali Efx} > — o, lim Efe} = Eaha AS 
whenever t is a limit point of T from the left. 

The inequalities relating the expectations may be vacuously true at some 
values of t, since the random variables in question are not necessarily 
defined for all £, but the continued inequality of the theorem holds for 
all ¢ in the sense that, whenever two of the random variables involved do 
exist, the stated inequality between their expectations is true. The 
inequalities have been proved in the discrete parameter case, and no 
further discussion of them is needed here. The uniform integrability 
statement is deduced from discrete parameter special cases as follows. 
The indicated x;s are uniformly integrable if and only if the «,’s of every 
sequence {x, } are uniformly integrable, and we may even restrict our 
attention to monotone sequences {s,}. Now, if {s,} is monotone de- 
creasing, we have proved [Theorem 3.1 (iv)] that the x, ’s are uniformly 
integrable if and only if 

lim E{a, }> — 9; 

nwo 
if {s,} is monotone increasing, with s, >t, we have proved [Theorem 
4.1s (ii)] that the x, ’s are uniformly integrable if and only if 

lim Efx, } = E{x,}. 

no 
These two are the stated conditions of the present theorem. 

According to Theorem 11.3, the x,’s of a semi-martingale, with << 4 
and ż in the parameter set, will be uniformly integrable, for example, if 
E{a,} is bounded in 1 and skips no values when ¢ increases, that is, if 
E{z,} runs through the values of an interval (which may degenerate into a 
single point). This is true, in particular, if the x, process is a martingale, 
when E{a,} is identically constant. However, nothing new is obtained in 
this case, since, according to Theorem 3.1 (iii) (applied to the absolute 
values of the martingale variables), if the æ, process is a martingale, the 
xs with £ < t are uniformly integrable. 

Theorems 4.4 and 4.4s showed how to extend the parameter sets of 
certain simple types of martingales and semi-martingales. The following 
theorem is the general parameter set version of these theorems. 


360 MARTINGALES Vil 


THEOREM 11.4 Let {x F, t¢T} be a semi-martingale [martingale], 
and let I be the closed interval with endpoints the minimum and maximum 
values of the closure of T, except that the right-hand endpoint is to be 
excluded from I unless this endpoint is in T, and the left-hand endpoint is 
to be excluded in the semi-martingale case unless G,L.B. Efx} > — 2%. 

teT 
Then it is possible to define x, and F , for t «I — T in such a way that the 
process {£y F „ t e I} is a semi-martingale [martingale]. 

Suppose that the x, process is a semi-martingale [martingale]. If 
te J—T and if t is a limit point of T from the right, define 


Li = Tijs F.=OF, 


at * 


The process with the thus enlarged parameter set is easily seen to be a 
semi-martingale [martingale] using the general parameter set version of 
Theorem 4.1s [Theorem 4.1]. (See also Theorem 4.4.) Let [c, d] be a 
closed interval whose endpoints but no other points lie in the closure of 
T. Then x, and F, are already defined, and we define 


pee F,=Fy, (C= 7-<=4): 
also defining 


=e EE” 
w ai A SF 


if x and F, are not already defined. The resulting process {x, F, t e1} 
is then a semi-martingale [martingale]. Finally we remark that the 
extension we have given is not the only one that can be given in some cases. 
Theorem 11.4 shows that in considering martingales and semi-martin- 
gales we can assume, if desirable, that the parameter set is an interval. 
Example 3 In §8 an example was given of a martingale {x,, n > 1} 
satisfying the conditions 


% 20, Ef,}=1, nl; 
lim x, = 0 
n> oO 


with probability 1. Define 
=e fe 


The process {#,, 1 < n< o} is then a semi-martingale whose random 
variables are not uniformly integrable, The set J of Theorem 11.4 
becomes the interval [1, œ], and #,_=0 with probability 1. The 
definition of @, for ¢ not an integer, as given in the proof of Theorem 11.4, 
becomes 


%=8,, n-l<tsn n=2,3,0+% 


$11 CONTINUOUS PARAMETER MARTINGALES 361 


In this example #,, could have been taken as any non-negative random 
variable with a finite expectation, but no choice would make the process a 
martingale. 

In the following we use the concept of jump of a function at a point, 
as defined in V §1, a discontinuity at which the function has one-sided 
limits, and at which the value of the function lies between these limits. 
If the function is complex-valued we shall say it has a jump if its real and 
imaginary parts do, or if one does and the other is continuous. 

THEOREM 11.5 Except possibly for a set of sample functions of prob- 
ability 0, the sample functions of a separable semi-martingale {a, te T} 
have the following properties: 

(i) They are bounded on every t set of the form (a, b,]T with a,, bı € T, 
or simply b, eT if the process is a martingale. 

(ii) They have finite left- [right-] hand limits at every t eT which is a 
limit point of T from the left [right]. 

(iii) Their discontinuities are jumps, except perhaps at the fixed points of 
discontinuity. 

Since a function which has finite left- and right-hand limits at all values 
of the argument where such limits are definable has at most enumerably 
many points of discontinuity, almost all sample functions of a separable 
martingale are this regular. The theorem is true for complex martingales 
because it is true for the corresponding real and imaginary parts. 

The generalized parameter set version of Theorem 3.2 implies (i) 
trivially. In proving (ii) and (iii) we shall suppose that T is infinite. 
Otherwise there would be nothing to prove. According to (i), there is 
an w set A, of probability 0, such that every sample function of the x 
process corresponding to an w¢ A, is bounded in every interval with 
endpoints in T. By definition of separability of a stochastic process, 
there is a sequence {f,} C T, dense in T, and an w set A, of probability 0, 
such that every sample function corresponding to a point  ¢ A, has the 
same lower and upper bounds on open intervals as on the ż„’s in the 
intervals, that is, 

G.L.B. x(w) = G.L.B. x, (o),  L.U.B. x(w) = L.U.B. x, (0), og Ay 
telT thet telT tel 

Now let a;, b, be points of T, with a, < b,, chosen so that (4), b,)T is not 

empty. Let ¢,'"), ty”, + + + be a, by, and those of the first n t,’s in (ay, b1), 

ordering the ¢,("”"s so that 4") < i <- Let ry, fg be real numbers 

with ry < ra and let Balo) be the number of upcrossings of [r ry] by 

x, (mw), rlw), * According to Theorem 3.3, 


Ef|æ,l} + In| 
E(B} < maaan 


360 MARTINGALES Vil 


THEOREM 11.4 Let {£ Fa teT} be a semi-martingale [martingale], 
and let I be the closed interval with endpoints the minimum and maximum 
values of the closure of T, except that the right-hand endpoint is to be 
excluded from I unless this endpoint is in T, and the left-hand endpoint is 


to be excluded in the semi-martingale case unless G.L.B. Efx} > — 2. 
teT 


Then it is possible to define x, and F, for t e I— T in such a way that the 
process {x,, F ,, t € I} is a semi-martingale [martingale]. 

Suppose that the a, process is a semi-martingale [martingale]. If 
te 1—T and if t is a limit point of T from the right, define 


ope ree of A Fy 


The process with the thus enlarged parameter set is easily seen to be a 
semi-martingale [martingale] using the general parameter set version of 
Theorem 4.1s [Theorem 4.1]. (See also Theorem 4.4.) Let [c, d] be a 
closed interval whose endpoints but no other points lie in the closure of 
T. Then x, and F, are already defined, and we define 


i=in Fr=Fy, (e <t <d), 
also defining 


Te = Xa, Fi. =F, 


if x, and F, are not already defined. The resulting process {2,, F, teI} 
is then a semi-martingale [martingale]. Finally we remark that the 
extension we have given is not the only one that can be given in some cases. 
Theorem 11.4 shows that in considering martingales and semi-martin- 
gales we can assume, if desirable, that the parameter set is an interval. 
Example 3 In §8 an example was given of a martingale {x,, n > 1} 
satisfying the conditions 


z S105 E{z,} = 1, n>1; 


lim x, =0 
n> 
with probability 1. Define 
&,=—&,, A ='1 


The process {%,, 1 < n< co} is then a semi-martingale whose random 
variables are not uniformly integrable. The set J of Theorem 11.4 
becomes the interval [1, oo], and #0 with probability 1. The 
definition of ê, for ¢ not an integer, as given in the proof of Theorem 11.4, 
becomes 


ei 2,5 n—1l<t<n E omen 


$11 CONTINUOUS PARAMETER MARTINGALES 361 


In this example 2,, could have been taken as any non-negative random 
variable with a finite expectation, but no choice would make the process a 
martingale. 

In the following we use the concept of jump of a function at a point, 
as defined in V §1, a discontinuity at which the function has one-sided 
limits, and at which the value of the function lies between these limits. 
If the function is complex-valued we shall say it has a jump if its real and 
imaginary parts do, or if one does and the other is continuous. 

THEOREM 11.5 Except possibly for a set of sample functions of prob- 
ability 0, the sample functions of a separable semi-martingale {x, ,t¢ T} 
have the following properties: 

(i) They are bounded on every t set of the form [a,, b,)T with a, b; €T, 
or simply b, eT if the process is a martingale. 

Gi) They have finite left- [right-] hand limits at every t eT which is a 
limit point of T from the left [right]. 

Gii) Their discontinuities are jumps, except perhaps at the fixed points of 
discontinuity. 

Since a function which has finite left- and right-hand limits at all values 
of the argument where such limits are definable has at most enumerably 
many points of discontinuity, almost all sample functions of a separable 
martingale are this regular. The theorem is true for complex martingales 
because it is true for the corresponding real and imaginary parts. 

The generalized parameter set version of Theorem 3.2 implies (i) 
trivially. In proving (ii) and (iii) we shall suppose that T is infinite. 
Otherwise there would be nothing to prove. According to (i), there is 
an w set A, of probability 0, such that every sample function of the a, 
process corresponding to an w ¢ A, is bounded in every interval with 
endpoints in T. By definition of separability of a stochastic process, 
there is a sequence {f,} C T, dense in T, and an w set A, of probability 0, 
such that every sample function corresponding to a point w ¢ A, has the 


same lower and upper bounds on open intervals as on the #,'s in the 
intervals, that is, 

G.L.B, x(w) = G.L.B. x, (0), L.U.B. x(w) = L.U.B. x(w),  o¢ Ag 

telT Hees tell tel 

Now let a, 5, be points of T, with a, < b,, chosen so that (a, b,)T is not 
empty. Let 4,!"), ta™, °° be a,, by, and those of the first n t;s in (a, bi), 
ordering the ¢;("""s so that 4 < 1 <+ ++, Let ry, ra be real numbers 
with r, < ra and let f,(w) be the number of upcrossings of [r ra] by 
x, (w), vyw), * * *- According to Theorem 23, 


E{|x,,|} + In| 
a 


362 MARTINGALES VII 


Then, if Ma; = {6,(@) = k}, 


Efe} + In] 
P{M,,} < Hse k>1, 
so that 
E 
(11.1) PU Mix} < Mal ol 


Now suppose that a sample function g(-) corresponding to a point w ¢ A, 
has an oscillatory discontinuity at some point s e [a,, b,]T, with either 


lim sup g(t) > r3 > r, > lim inf g(t), t e [a,, 5,7, 
tle tts 


or the same inequality with £ } s. Then this same inequality is true if 
t approaches s remaining in {¢,}, so that the number of upcrossings of 


[ro re] by g(4'"), g(te'”), > + + becomes infinite when n > œ. Thus, if. 
M is the w set corresponding to the sample functions g of this type, 
(11.2) MOOM e hel Oy tars 3, 


According to (11.1) the intersection in k of the w sets on the right in 
(11.2) has probability 0. Let A(ri, 72, a, bı) be this intersection, and 
define 
Ay = Pape A(T, fo a4, b1), 
where ri, 7a vary over all rational numbers with r} <r, and a, [d,] is 
either the minimum [maximum] value of T if there is one, or in the con- 
trary case a, [b] varies over some sequence of values in T converging to 
G.L.B. t [L.U.B. t]. Then 
teT teT P{A;} = 0, 


and any sample function of the process corresponding to a point not 
in A, UA, U A, has finite left- and right-hand limits at each point of dis- 
continuity. Moreover, because of the defining properties of the #,’s, any 
discontinuity of such a sample function at a point other than a 7, must 
be a jump. Now, if a #, is not a fixed point of discontinuity, the sample 
functions are continuous at ż; with probability 1. Hence, if a further w 
set A, of probability 0 is excluded, we can say that any discontinuity of 
a sample function not corresponding to an w in one of the A,’s is a jump, 
unless the point of discontinuity is simultaneously a 1, and a fixed point 
of discontinuity. This completes the proof of the theorem. 

Theorem 11.5, as well as some of the earlier results of this section, 
have related to separable semi-martingales, not semi-martingales in 
general. However, if the stochastic process {z,, t e T} is a semi-martingale, 


$11 CONTINUOUS PARAMETER MARTINGALES 363 


it follows from II §2 (see'II, Theorem 2.4) that there is a semi-martingale 
{č t € T} such that, for each z, 


P{ž (w) = x(w)} = | 


and that the č, process is separable. The separable process results are 
then applicable to the č, process. The results can also be expressed 
without the use of the concept of separability. To show how this can be 
done we rephrase Theorem 11.5 (ii) in this way: 

[THEOREM 11.5 (ii)] Let {a,, t eT} be a semi-martingale, and let T, be 
a finite or enumerable subset of T. Then the sample functions of the x, 
process almost all have the following property: they coincide on T, with 
functions defined on T which have finite left- [right-] hand limits at every 
t e T which is a limit point of T from the left (right). 

The truth of the original version of Theorem 11.5 (ii) for the , process 
defined above implies the truth of this alternative version, because 


P{č (w) = x(w), t eT} = 1. 


Moreover, the alternative version implies the original one if we choose 
T, as the finite or enumerable set involved in the definition of separability. 
We now continue our discussion of the general parameter set versions 
of the discrete parameter theorems of previous sections. 
(85, Zero-one law) Let {x,, t eT} be a process with independent incre- 


ments, and let z be a random variable which for each b, <b = L.U.B. t 
te? 


is a random variable on the sample space of differences x, — x, with 
b, <s, <b, 


Then z(w) = const. with probability 1. The proof can be given directly 
or derived from martingale theory as in the discrete parameter case. 

In the following we shall be interested in separable stochastic processes 
{a,, t e T}, and shall be particularly interested in the differences a, — Xa. 
We shall use without further reference the obvious fact that, if the 2, 
process is separable, and if z is any random variable, then the x, — % 
process is also separable. 

(THEOREM 5.1) Let {tp t¢ T} be a separable stochastic process with 
independent increments, and suppose that T has initial point a and last point 


b. Then, if E{a,—%,} = 0, 
E{L.UB. |e; — zal} < 8E{la, — zal”), «> 1. 
te? 
(This result could be stated slightly more generally to follow Theorem 


5.1 more closely, but the generalization is an immediate consequence of 
Theorem 5.1 and has no independent interest.) This theorem can be 


364 MARTINGALES Vil 


proved by translating the proof of Theorem 5.1 into the continuous 
parameter case, or it can be reduced to Theorem 5.1 by remarking that 
according to Theorem 5.1 the desired inequality is true if the left side is 


laced b; 
replaced by E{L.U.B. |x, — x|} 
j 


if {t,} is a finite subset of T, and therefore if {t;} is an enumerably infinite 
subset of T. Since the x, process is separable, the sequence {f,} can be 
chosen to make 
L.U.B. |æ, — tal* = L.U.B. |x, — x,|* 
i teT 


with probability 1, and the theorem is now completely proved. 

The extension of Theorem 5.2 to more general parameter sets is obvious, 
and we omit an explicit formulation. 

(§6, Strong law of large numbers for mutually independent random 
variables with a common distribution function) Let {x,, 0< £ < co} be 
a separable stochastic process with stationary independent increments, and 
Suppose that E{x,— a} = 0. Then 


(11.3) lim == 0 


tow t 
with probability 1. 
To prove this theorem we observe first that, according to the discrete 
parameter version of §6, 


a, gic fe ne 
(11.4) lim *—* = lim- X @,—2)_,) = E(x, — to} = 0, 
no n non Nj=1 


and 
7 1 n 
lim— > LUB.  |x,—a,| = E{L.U.B. |x,— zl} 
0 1 


no M j=1 j~1<t<j <t< 
with probability 1. (According to the general parameter version of 
Theorem 5.1 the last expectation is finite.) Then 


n-1 
lim L.U.B. |z,—2,| = E{L.U.B. |x,— 
noo n— I Š #-1stsj ee cera [e= al} 
and we can obviously replace 1/(n— 1) by I/n here. Subtracting the 
resulting equation from the preceding limit equation, we find that 


cord 
lim — L.U.B.  |a,—2| = 0 


nao N n-1<t<n 
with probability 1, and this result combined with (11.4) yields 
Ly — To 


r Ge: 
lim = lim ~=0 
ten t t>o 


$11 CONTINUOUS PARAMETER MARTINGALES 365 
with probability 1. This theorem could also have been proved by 
showing that, if ¢ > 1, 3 

%,— To 


Efx, — To | Ts — To, $ > t} = Ex, — % | x, — To} pe 


and then using the continuous parameter version of Theorem 4.2 or 
Theorem 4.3 to show that the terms of the equality converge to a limit 
when t-> œ. The limit is a constant by the zero-one law, and the 
constant is 0 because the limit is a random variable in the same martingale 
as a, — o and hence has the same expectation. 

The following preliminary discussion will lead to the general parameter 
set versions of Theorems 2.1 and 2.2. The treatment will be more 
unified here than in §2 because Theorem 2.1 was given special treatment 


in §2 in view of the references to it in §4. 
The point of the general parameter set versions of Theorems 2.1 and 


2.2 is that they presuppose a semi-martingale or martingale {£n t e T(a)}, 
and assert that a new process {#,, «  T(%)}, obtained from the x, process 
by a certain type of sampling, is also a semi-martingale or martingale. 
In Theorems 2.1 and 2.2 both T(x) and 7(#) are sets of integers. In the 
more general case to be discussed now, each of these two parameter sets 
will be an arbitrary linear set. We make the following hypotheses, which 
reduce in the discrete parameter case to those made in §2. 

OS, {£p te T(a)} is a stochastic process. 

OS, For each t e T(x), F, is a Borel field of measurable w sets, with 
the following properties: 

(a) OF Marat 

(b) x, is either measurable with respect to F, or is equal for almost all 


w to a function that is. 

OS, Almost all sample functions of the a, process have finite limits from 
the right, by Saas 
for all t « T). gi 

OS, {Ta % € T(7)} is a stochastic process defined on the same w space 


as the x, process, and has the following properties: 
(a) for each a € T(7), the values taken on by 7, lie in the set T(x); 
(b) 7,(%) is monotone non-decreasing in « for fixed w; 
(c) if æ e T(7), either 
{7,(w)<s}eF, s eTA), 
or the œ set on the left differs by at most a set of probability 0 from an 
F , set. 


366 MARTINGALES VI 


If each F, contains all œ sets of probability 0, the weaker alternatives 
in OS, (b) and OS, (c) need not be given. Moreover, it is no restriction 
to assume that each F, contains the w sets of probability 0, since, if this 
is not true initially, Z , can be replaced by the Borel field generated by the 
sets of F, and those of probability 0. 

We now define the process {#,, « e T(r)} obtained from the x, process 
by optional sampling. For each « let S, be the (at most enumerable) set 
of values which 7, takes on with positive probability. Define %,(w) by 


(11.5) #0) =A, if 7,(0) Sy 
=o), if 740) ¢ Sy. 


This definition is-meaningful on the w set of probability 1 corresponding 
to the x, process sample functions which have limits from the right at all 
values of the argument, except that č, is not defined when 7, takes on 
with probability 0 one of the at most enumerably many values in T(%) 
which are not limit points of T(x) from the right. In other words, we 
have defined #, for each « e 7(r), with probability 1. We shall prove 
that with this definition &, is a random variable, that is, a measurable w 
function. We expect that just as in the discrete parameter case martingale 
and semi-martingale properties of the x, process go over into similar 
properties of the @, process. Note that OS, is satisfied whenever the 2, 
process is a separable semi-martingale or martingale, by Theorem 11.5. 

The most important type of optional sampling is optional stopping, 
defined as follows. Let OS,, OS,, OS, be satisfied, and let 7 be a random 
variable, — œ < 7(w) < œ, satisfying 


OS, (a) The values + L.U.B. t taken on by 7 lie in the set T(z). 


eT 
(b) Either ‘gpa 
{r(w) < she F,, se T(x) 


or the set on the left differs by at most a set of probability 0 from an F , set. 
Then, if we define T(r) = T(x), and 


7) = Min [t, 7(@)], t e T(x), 


the 7,’s satisfy OS, and the optional sampling determined in this way will 
be called optional stopping. According to the definition, 


Žo) = talw), if t<(w) and P{r(w) > t}= 0, 
= x(w), if t<7() and P{r(w)>t}>0 
E E, 
except that, if 7 takes on any value s with positive probability, 
(ow) = x(w) if t>s and 7(m)=s, 


$11 CONTINUOUS PARAMETER MARTINGALES 367 


The following approximation procedure will be used. Suppose that 
OS,-OS, are satisfied. For each positive integer q choose finitely many 
points 

a? <al? <+++ aj € T(r) 
in such a way that the first g points of S, enumerated in some order are 
a;‘®”s and that every point of T(x) in the interval [— q, q] is within distance 
I/q of some a;'”, and that the infinite points of T(x), if any, are ajs. 
Define 


ž w) = taolo), if 7,(w)<a,™, 
% ay a! 1 
= taol), if a4 <7,(0)<a', j>l 


=0, if Maxq™ <7,(o). 
j 


Then #,(” is a measurable w function, that is, a random variable, and 


lim #9 = &, 

go 
with probability 1. Hence č, is a random variable. We shall find it 
helpful to specify more closely a Borel field F , with respect to which a, 
is measurable. The important point will be that this Borel field will be 
defined in terms of the F ’s and 7,’s, and will thus not depend on the xs. 
Fix « e T(r). For each positive integer q let 7," be the Borel field gener- 
ated by the w sets of probability 0 and by those of the form 


Aa <7,(0)<}}, Ae U Fo 


where a, b, c are not necessarily finite, no one of the first q points of S, 
is an interior point of an interval (a, b], and arctan b — arctan a < 1/q. 
It is no restriction to take b e T(z) and c = b here. If b= -— We T(x), 


we understand by the above 

Afr (w) = — oo}, NESA 
Define 

Z= OF at 

Then the approximant «, defined above is measurable with respect to 
Ff for large q, and therefore &, is measurable with respect to F,” for 
every r, so that č, is measurable with respect to F ,. Moreover, if « < f, 
it follows that ¥,C F;. To prove this we prove that FECF,! for 
alf q, by proving that each set of the class generating F,’ is an F," set. 
It is sufficient to consider F,, sets of the form 


Ay = Ma <1(0) Lb) Ne Fy, be TO), 


368 MARTINGALES vii 


with b finite. Trivial changes in the argument take care of non-finite b. 
Note that Aj eF,. Since 7, < Tp, 


Ay = Afa < rw) < b} 1 Ab + (j—1)8 < ro) < b + jd} 
U Afb + nd < rw) < 00}, 


and this union exhibits A, as an ¥," set, if ô > 0, if arctan ô < 1/4, and 
if 7/2 — arctan (b + nd) < 1/q. 

The point ‘of the following discussion is to prove that under suitable 
regularity conditions a semi-martingale [martingale] {x,, F ,, t € T(a)} goes 
into a semi-martingale [martingale] {#,, F,, t ¢ T(7)} under optional 
sampling. Note that, if is a Baire 7, 2 function, and if p, is defined by 


po) = Vt, x(o)], 


then, if the x, process satisfies OS, and OS,, the p, process satisfies the 
same conditions with the same family of Borel fields. Moreover, if Pi 
satisfies OS, and if under optional sampling determined by Ta S Satisfying 
OS, the p, process goes into a %, process, then {n F, œ e T(r)} if 
specified regularity hypotheses are satisfied, will also be a semi-martingale 
[martingale], Here ¥,, as defined above, does not depend on the choice 
of ®. In particular, if ® is continuous, 


Po) = Diro), €(0)], 


and the g, process will satisfy OS, if the x, process satisfies OS}. This is 
the most important special case. 

We now discuss the general parameter set version of Theorem 2.2. 
The results will be divided into a succession of lemmas and theorems for 
greater clarity. These results will contain Theorem 2.2 as a special case, 
and will even sharpen this case in some directions. The special case of 
optional stopping, corresponding to Theorem 2.1, will not be stated 
separately. We shall assume throughout the proofs that the processes are 
real, omitting the trivial remarks necessary to extend the discussion to the 
complex case. Throughout the discussion we assume that Eo F ,, te T(a)} 
is a semi-martingale or martingale satisfying OS,, OS,, OS,, and that it is 
transformed into the process {z,, ¥,, æ e T(7)} by 7,’s satisfying OS,. 

LEMMA 11.1 [fae T(r), s T(x), Ae Ž „ then 


(11.6) Afr (w) <= s} EF, 


To prove this relation suppose first that s eS,, Say that s is the ggth 
point of 8, according to the enumeration of S, used in the definition of 
the Fs. Then, if q >qp, we shall Strengthen the lemma by proving 


$11 CONTINUOUS PARAMETER MARTINGALES 369 


that (11.6) is true even for Ae.F,,’. It is sufficient only to consider sets 
A of the form 


A = M{c, <7,(@) < ¢9}, MEF, ceT(x), CXC, SE (C4, Ca), 
since the sets of this type generate ¥,%. For such a set, 
A{r,(w) < 5} = Ae F,, if sc, 
= null set eF, if s<G, 
as was to be proved. As a second case, suppose that s is not a limit 


point of T(x) on the right. (The point may or may not be in S,.) Choose 
q so large that 


1 
arctan s, — arctan s > 3 if 5,>s, EL) 


Then again we strengthen the lemma, proving that (11.6) is true even if 
AcF,!. Itis sufficient only to consider sets A of the same form as above, 
except that s may now be in (cı, Cy). However, according to the hypo- 
thesis on q, if s € (cy, Cy), there is no other point of T(x) in this interval to 


the right of s. Hence 


Afro) < s} = A EF p» if seg, 
= null set eF p if ssc, 
eF if cy<s <i 


Finally suppose that s ¢S,, and that s is a limit point of TŒ) on the right. 
We prove first that, if sı > s and s; € T(x), then 


(11.6) M{r,(w) < s} EF s 
if i 
MeF,', > < arctan s, — arctan s. 
Tt is sufficient only to consider F,” sets M of the form 

M= Miler <7,(@) S Co}, M; eF,, CETE) Clty 


For such a set, our hypothesis on q implies that sı > Cz if s > c, so that 


Miro) < s} = MeF,,, if 52> Cy, 
= null set eF,,, if s<a, 
EF p if gas Lir 


We have thus proved (11.6). This relation implies 
M{z,(0) < S} €F s 


370 MARTINGALES vil 


and thereby implies the desired relation (11.6), since, according to the 
hypothesis that s ¢ S, 

Pfr (w) = s} = 0, 
so that 

Mir (o) = s} eF pş 


Lemma 11.2 Suppose that the x, process is dominated by a semi- 
martingale. Then if ««T(r) and if seT(Œ), the random variable &,, is 
integrable on the w set {t,(@) < s}, and the integrability is-uniform in «. 
Ifs is ana; for every q, the random variables of the sequence {x,\", q = |} 
are uniformly integrable on the w set {r (w) < s}, and the degree of uni- 
formity does not depend on « or on the choice of the a;\”’s. 

We recall that the semi-martingale {x,+, F ,, t € T(x)} is said to dominate 
{to F n t € T(x)} if 


P{|x(o)| <a(o)}=1, te T(x). 


If the x, process is a martingale, we can take x,t = |a,|. We can of course 
take x+ = x, if the x, process is a semi-martingale with non-negative 
random variables. More generally, since the positive part of the x, 
process, {(|x,| + x)/2, F,, te T(x)}, is a semi-martingale, the condition 
that an x+ process exist is a condition on the order of magnitude of the 
negative part of x, — (|x,| — x,)/2. If there is a random variable z > 0 
for which 
Pla (o) >—Ao)}=1, te T(r) 


E{z} < œ, 
then an x,* semi-martingale which dominates the x, process can be defined 
by 
xt = Ee |F} + (læ + x,)/2. 

According to Theorem 1.2 (iv), every semi-martingale whose parameter set 
is the set of positive integers is dominated by a semi-martingale, so that 
in this case an vt process always exists. This fact explains the lack of 
mention of an x;* process in Theorems 2.1 and 2.2. We remark, however, 
that even without the existence of the x;+ process it is possible to prove 
that [%,| is integrable on the œ set {rw < s}, se T(r), under the 
relatively weak assumption that 


G.LB, Efe} > — o. 
t eTO) 


To prove the lemma, define 
M,! = {7,(@) < a,}, 
Myf = {a4 <7,(0)< ORD bef ale 


$11 CONTINUOUS PARAMETER MARTINGALES Smi 


Then, using the semi-martingale property of the x; process, 


(11.7) | @ar= 5 J bao] @P 
[ERA GOSE Mair, (a()|>A} 
Talo) <8. 7 
pes Lao? dP 
OSE Matjea (ay()| >A) 
SOS f x; dP 


WOLE MA (leg ialo) >A) 


= | pea LF 
Max zt(o)>4} 


where S is a finite T(z) set in the interval (— oo, s], namely the a,'”’s in 
that interval. Now, according to Theorem 2.2, 


(11.8) P{ Max glo) > A} < ; Efx,t}. 
teS fe 


Hence 

lim P{Max x;+(w) > 2} = 0 

ao tes 
uniformly in S. Hence the left side of (11.7) goes to 0 when 2 > «, 
uniformly in «, q, and in the choices of the as. When q —> ©, (1 1.7) 
yields, in view of the uniformity of the integrability, 


(11.9) J lelar< O eera 
(Aaa {L.U.B. x+(w)>A} 
Fao) <a teU 


where U is a certain finite or enumerably infinite 7(«) set in the interval 
(— œ, s]. Since (11.8) holds for finite sets, it must hold for enumerably 
infinite sets. Hence the probability of the domain of integration on the 
right in (11.9) goes to 0, when 4 —> œ, uniformly in U, and this implies 
the truth of the lemma. 

Lemma 11.3 Suppose that the x, process is dominated by a semi- 
martingale. Let %, BeT(r), a< P, $ eT(a), Ae F,. Then, if the x, 
process is a semi-martingale, 


(11.10) M #aP< Í #dp+ | «,dP 
A{7,(o) < -A {rp(m) <8} TAO) > 
{7_(o) <8} plo) ss N] 
(11.11) #,dP< f zdr. 
A {7,(m) <8} Afro) <8) 


If the 2, process is a martingale, these inequalities become equalities. 


372 MARTINGALES Vil 


Suppose that the x, process is a semi-martingale, and choose a;‘”’s to 
match both 7, and 7g, so that both č,‘ and ž¿® are now defined, and 


Satay ee eta eee 
lim p O =, lim %,'9 = ds 
qo qo 


with probability 1. Define 
Aj? = Naah <7,(@) < a;'}, 
A= Aaa” <7(a)sa}, ke, 
My? = Afro) > a) = A U Ant, kaj 
Then, by Lemma 11.1, 
AS eFam, ApteF aa, My EF ayn 


so that, using the semi-martingale property of the x, process, 


aiD [AOP | Tae dP 


Ag At 
= | Ta dP + f Tay dP 
Ajj" M; 


S | ran dPih | tauo dP 


Ass! Myj’ 
F, fa Tayn dP + fa Vaja wdP + fe Vay dP 
Aj? Ajj! Mijn! 


< S i) Tapo dP + | Tq via dP, N >j 


k= Qi a Myy@ 
= — fe aP + f amd P. 
Aj" {rgo Say} Myy(t 


Now suppose that, for each q, s is some a,;“. Then, choosing N above 
so that a," = s, and summing over j < N, we find 


#°aP< [ a)" dP + if x, dP. 
A{t,() <8} A(z 9(m)<s} Afty(w) <8, 78w) >8} 


According to Lemma 11.2 the integrands are uniformly (in q) integrable 
over the indicated integration sets. Hence when g— œ we obtain 


$11 CONTINUOUS PARAMETER MARTINGALES 373 


(11.10). To obtain (11.11), apply the semi-martingale inequality to the 
first line of (11.12) to obtain 


Js= Pere EN 


Ag Aj 


Summing over j < N, and letting qg > oo yields (11.1 1), In the martingale 
case, all the above inequalities become equalities, so that there is then 
equality in (11.10) and (11.11). 

THEOREM 11.6 Suppose that the x, process is dominated by a semi- 


martingale, and that 
L.U.B. t = be T(z). 


te T(x) 


Then, if the process {ty F n t € T(œ)} is a semi-martingale [martingale], the 
process {ix F „ o e T(7)} is also a semi-martingale [martingale], with 


(11.13) GLB. Efx} < Efé,} < Efx} 
te T(z) 


In the martingale case, this inequality becomes an equality. 


By Lemma 11.2 with s = b, E{|ž,|} < œ. Ifs = b in (11.10), we obtain 


[ears | aaP, NheF, «<Ê. 
A A 


This is the semi-martingale inequality for the %, process, and according 
to Lemma 11.3 there is equality if the æ, process is a martingale, so 
that in this case the #, process is also a martingale. If A = Q, and 
s= b in (11.11), we obtain the right-hand half of (11.13), with equality 
in the martingale case. Since the first and third terms of (11.13) are 
equal in the martingale case, there is equality throughout in that case. 
There remains the proof of the left half of (11.13) in the semi-martingale 
case. If T(x) has a minimum value a, the proof of the left half of (11.13) 
follows that of the corresponding inequality in the discrete parameter case 
(in which a = 1) given in the proof of Theorem 2.2, and will be omitted. 
We now prove that we can always assume that 7(z) has a minimum value, 
because, if there is no such value initially, one can be adjoined, and the 
truth of the inequalities in question for the extended processes will imply 
their truth for the original ones. We can assume that T(x) is bounded, 
making a bounded monotone transformation of this set, if necessary, to 
insure the boundedness. Suppose then that T(x) has no minimum value, 
and define a = G.L.B. t. By hypothesis the x, process is dominated by 


te T(x) 


a (non-negative) semi-martingale, whose random variables are uniformly 


374 MARTINGALES VII 


integrable for t<t,€ T(x) (f fixed) according to Theorem 3.1 (iii). 
Then the xs are also uniformly integrable for t < 4, so that 


lim Efx} > — œ. 
toa 


There are then, according to Theorem 11.2 (i), random variables which 
we shall denote by x,y, Xa+, such that 

Aim eg lim at = 4,3 

toa toa 


with probability 1, if fa along a parameter sequence. We adjoin the 
point a— 1 to T(x), and define 


Fea = QT. 


thus obtaining the desired new parameter set with a minimum value. We 
chose a — 1 here instead of a to insure that OS, would be verified (vacu- 
ously) at the new parameter value. 

THEOREM 11.7 Suppose that the x, process is dominated by a semi- 
martingale. Then 


(11.14) E{|#,|} < 3 L.U.B. E{|z,|}, 
te T(x) 


and the factor 3 can be replaced by 1 if the x¿s are non-negative. 


We can assume that T(x) is bounded in proving this theorem, making 
a bounded monotone transformation of this set to insure boundedness, 
if necessary. Let L be the L.U.B. on the right in (11.14). Let t¢ T(x), 
and define 7,’ by 


T (w) = Min [s, 7,(@)], St se T(2), 


=f s=t+1. 
Let 
Tay = T(x) n [— %, t], 


and let T(z)’ consist of the points of T(z)’ and in addition the point t + 1. 
Then the family {r,, F, s¢T(z)'} satisfies OS, Let the process 
{x,, s€T(x)} go into the process {x,, s¢7(z)'} under the optional 
sampling determined by the 7,’ family. Since 7(a)’ has a last element f, 
the x,’ process is a semi-martingale, according to Theorem 11.6. Hence 
(11,13) is satisfied for this optional sampling transformation, yielding 


Efe,} > G.LB. Efx} >— L. 
8 € T(x) 


$11 CONTINUOUS PARAMETER MARTINGALES 375 


According to Theorem 3.1 (ii), 
Ef|es |} < — Efe} + 2Æf|en l} 
=— Efx,} + 2E{|z,|}< 3L. 
In view of the definition of 7,’, this inequality for s = ż is equivalent to 
\#,| dP < 3L 
(740) St) 
which yields (11.14) when ¢ increases. As the proof shows, the bound 
3L can be replaced by 
2L— G.L.B. E{x,}. 
se T(x) 
In particular, if the xs are non-negative, (11.13) yields 
#,dP = Efx; } < Efe} < L. 
(rv) St} 


This inequality yields (11.14) with L on the right instead of 34, when ż 
increases. The device used here could have been used to deduce the 


right-hand half of (11.13) above. 
“LemMa 11.4 Suppose that the x, process is dominated by a semi- 


martingale. Then if % € T(r), Si Sq € TŒ), 5) < Sa, AcF,, 
,de< [ #dP+ { dP. 


Aftg(w) > #1} A< TSh) Af1g(w)> 4a) 
To prove this lemma define the random variables 7’, 7” by 
ro) =; 7') = Spy if 1,(~) < Sp 
= 7,(@), if s, <7,(@) S So 
=, Say if 7,(w)> 5p. 


The family of random variables consisting of the two random variables 
x’, 7” satisfies OS, Let the process {to Fn $ € [s1 Sq]7(x)} go into the 
process consisting of the two random variables ,,, # under the optional 
sampling determined by 7’, 7”. Since the parameter set of the original 
process has a last element sy, the process {x,,, Ž} is a semi-martingale, 
according to Theorem 11.6, and we obtain, applying the semi-martingale 
inequality, 
z dP< f #daP 


Al?) >a) A{r'(w)> 81} 
util ha Pe f „dÈ, 
ar ARSA OELN] Airai) >t) 


as was to be proved. 


376 .  MARTINGALES vi 
THEOREM 11.8 Suppose that the x, process is dominated by a semi- 
martingale {x;*, F, t € T(a)} and that 


b= L.U.B.t¢ T). 
te T(x) 


(i) If the x, process is a semi-martingale, if 


(11.15) E{t,|}< 0, a eT), 
and if 
(11.16) lim inf Í 2,dP=0, «eT(r), 
sb fr (u> 
ee 


it follows that the č, process is also a semi-martingale, with 


(11.17) G.LB. Efx} < Efé,},  « € T(7). 
te T(x) 
(ii) Under (11.15) and 
(11.16) lim inf if lel dP =0, aeT(r), 
b ra (0)>s) 


(11.17) can be strengthened to 
(11.17)  G.L.B. Efa,} < Efž,} < L.U.B. Ef,}, æ e T(r), 
te Tx) te T(x) 


and it then follows that, if the x, process is a martingale, the ž, process is 
also a martingale, with equality in (11.17’). 

(iii) Each of the following conditions implies the validity of (11.15) and 
(11.16’). 

C, The x,'s are uniformly integrable. 

C, Each r, is bounded from above (with probability 1) by a value in T(x). 

C, There is a constant K>O with the following properties. The 
parameter set T(x) contains the integers > K (but not b = œ). For each 
integer n > K, and each a € T(r), 


(11.18) Ermat ent FK for n<7,(w) 
with probability 1. Moreover, 
(11.19) E{|7,| + Ta} < 0, a e T(r). 


(iv) The following condition C, [Cy] implies the validity of (11.15) and 
(11.16) [(11.16/)]. 

C, (11.15) is true and there are a random variable z > 0, with Efz} < œ, 
and a sequence h <t, <: + +, t, € T(x), t, >b, such that 


(11.20) P{|x,(o)| Sa,(o)—eo)}=1, 1t>t,i>1. 


§11 CONTINUOUS PARAMETER MARTINGALES 377 


Cj The x, process is a martingale, (11.15) is true, and there are a 
random variable z > 0, with Efz} < œ, and a sequence ty <t,<: ++, 
t, € T(x), ta > b, such that 
(1.20) Pilx(o)| > |x,(o)|—2@)}=1, t>tpi21. 

Proofs of (i), (ii) Although we could use the result of Theorem 11.6 to 
prove (i), by means of the device used in the proof of Lemma 11.3, we do 
not do so, because we have already set up the necessary inequalities. In 
fact, according to Lemma 11.3, if s e T(x), «, Be T(r), « < B, he F,, 
and if the x, process is a semi-martingale, 


(11.21) | #ap=z [ 4cr+ E 
Aft, () <8} A(r,(@) <s} ATH Za 
= #dP+ f «dP. 
Altg(w) Ss} {ao 
When s —> b we obtain, under (11.15) and (11.16), 
[de< f žar, 
A A 


so that the č, process is a semi-martingale. Inequality (11.17) is proved 
in precisely the same way as the left half of (1 1,13), Under (11.15) and 
(11.16’) the right-hand half of (11.17’) follows from (11.11) with A = Q, 
when s—>b. This finishes the proof of (i), (ii) in the semi-martingale 
case. In the martingale case the first line of (11.21) is an equality, and 
leads, when s > b, to the martingale equality for the č, process. In the 
martingale case the extreme terms of (11.17’) are equal. 

Proof of (iii) If the a's are uniformly integrable, that is, if C, is 
satisfied, then Ef|2,|} is bounded in z, and then E{|#,|} < co by Theorem 
11.7. Thus (11.15) is satisfied. The limit relation (11.16%) is also true 
in this case, since as s  b the domain of integration in (1 1,16’) decreases 
to a set of probability 0. If each 7, is bounded from above with prob- 
ability 1 by a value in T(x), that is, if C, is satisfied, (11.15) is true 
according to Lemma 11.2, and (11.16’) is also true because in this case 
the limit inferior in (11.16’) is actually attained for s near b. (The whole 
treatment of C, can trivially be reduced to the case discussed in Theorem 
11.6.) 

If C, is satisfied, set 

Yg = trh, 
y=zt— th jo. 
Here we have supposed that K is an integer. Ifitis not, it can be replaced 
by any larger number that is. Define 


378 MARTINGALES VII 


wo) =2;(o), if j-1<7,.o)<j, j>K, 
eo if 7,(w) < K. 
Then 
oO 
E{w}= > zj dP 
E+ G_1<7, (0) <i} 
n 


=lim > f  @xttyeut--++9)4P 


nso K+1 {j-1<7,(0) <j} 
n 

= f agtap+timt> f yaP 

{K<7,(0) <n) noe KHI (o)>9-1} 

= f tag) dP] 
fralo) >n} 
< Bfex*} + lim sup 5 | y;dP. 
Etl (o> 5-1) 
a’ 


Now the w set {7,() > j— 1} is an F,_, set by hypothesis. Hence we 
can use condition C, to continue this inequality, obtaining 


E(w} < Efe} + Š KP no) Syn 


< Efex} + KE{|7,| + Ta} < 00. 
Now suppose that /— 1 and j are a,‘®’s. Then 
oldes > f taot dP 
{j-1<ty(w) <5} JIa] fa (0 <T (w) <a} 
< xt dP. 
G-1<7,(0) <i) 
When q —> œ, this inequality becomes, in view of the uniformity (in q) of 
the integrability of the integrand on the left, 
ar fata. 
G-1<7,(0) <3} {j-1<7,(0) <j} 
Thus we obtain, if s is an ee ands > K, 


eea [lar 
{r,(0)>8} EHI §_1<3,(0) <i} 
% 
<< aj+ dP 
EFI Gj_1<7,(0) <5} 
= f wdP. 


{7,(0)>s8} 


$11 CONTINUOUS PARAMETER MARTINGALES 379 


Since |#,| is already known to have a finite integral on {7,(w) < s}, from 
Lemma 11.2, we have now proved that (11.15) is true under condition Cs. 
Since the last integral in the preceding inequality goes to 0 when s > o, 
(11.16’) is also true. 

Proof of (iv) Under C,, (11.15) is satisfied by hypothesis. To prove 
(11.16) we note that, under C,, 


|€(o)| Sa,(w)—2o) if Tao) >t, 
with probability 1. It then follows that, if 7 is fixed and s > t; S € T(x), 


(11.22) JEEE J č] dP + J a, dP 
ee [eae (ars 
x4) > 0) x,(@)>0 x, (@>0) 


< .f. leP+ | del +2aP 


{8<7,(w) Sti} (rao) > ty} 


= | (&l+9aP>0, (>). 


{7(@)> 8} 


Thus (11.16) is satisfied. Finally under C,’ the argument just given can 
be applied to |x| instead of x, to give the desired conclusion. 

The proof of the theorem has now been completed. In the applications 
the field F, is ordinarily taken as the field of œ sets determined by con- 
ditions on the a,’s for s < t, and this will be understood below unless 


other definitions are given. 


Note added in proof. In Theorems 11.6, 11.7, 11.8, the condition A that the x, process be 
dominated by a semi-martingale is not a restriction on the 2, process in the martingale case, 
but is apparently one if this process is a semi-martingale. This restriction can be eliminated 
as follows. If the a, process is a semi-martingale, and if uni = max (ta n), where n is a 
negative integer, the u,, process is also a semi-martingale, and 

lund < — n + max (x, 0). 
Thus the un, process is dominated by the semi-martingale defined by the right side of this 
inequality. Theorems 11.6, 11.7, 11.8 can now be applied to the Kn: process, and yield 
corresponding theorems for the x, process when n > — co. In this way it is found that 
Theorem 11.7 is true without A, and that Theorem 11.6 is true if A is replaced by the weaker 
condition A, that G.L.B. E{x,} > — ®©. If A is dropped in the statement of Theorem 11.8, 
t 


€ 
the following other changes should be made: condition A, should be made a part of (iii) Cy; 
in (iii) Cs, x;* should be interpreted as max (x; 0), forj = n, n + 1, Ay should be made a part 
of the condition, and the thus modified condition then implies the validity of (11.15) and 
(11.16) [not of (11,16’)]. Note that, in the discussion of (iv) Cy, it is now necessary to set 


s=l 

The definition we have given of the optional sampling transformation 
is the useful one for most purposes. However, it can be modified in 
various ways. For example, each S, could be arbitrarily augmented by 
an at most enumerable T(x) set, thus modifying the definition of #, in 
(11.5). This modification would require no change in either the results 


we have obtained or their proofs. 


380 MARTINGALES Vi 


As an application of optional sampling, we discuss the extension of a 
discrete parameter sequential analysis result obtained in §10. Let 
{Ya 0< t< ©} be a separable stochastic process with stationary inde- 
pendent increments, and suppose that yọ = 0. The following regularity 
hypotheses are made. We shall see in VIII that they are largely implied 
by the qualitative description of the process just given. 

SA, Almost all sample functions of the y, process are continuous on 
the right at all parameter values. 

SA, E{\y,|} < 00 for all t, and there is a constant a such that 


Efy}=at, t20. 


Let 7 be a random variable satisfying the condition OS,’ for a function 
which determines an optional stopping procedure. We wish to evaluate 
E{y,}, and it is natural to conjecture that 

Ely,} = aE 7}. 
From now on we therefore impose the additional restriction that E{r} 
exists, To derive the desired equation we proceed just as in the discrete 
parameter case of §10. Let x,=y,—at. Then the x, process is a 
martingale with independent increments. We shall apply Theorem 11.8 
to this process. The set T(r) in the present application consists of only 
a single point, and the corresponding +, is simply 7. Then condition C} 
is satisfied, with 

K = E{|z,]}. 

According to Theorem 11.8, 

Ef} = Efz,}, 
that is, in the present case, 

0 = Efx,} = Ely, — ar} = Ely} — abr}, 


and this equation gives the desired evaluation of Efy,}. 

Example 4 Let {x„ n >0} be a martingale, and define 7, as the 
largest integer < t, for0 < t < œ. Then the family of random variables 
{frp O< t < œ} (all the random variables are of course constants) 
determines a system of optional sampling, and the #, process obtained by 
the sampling is given by 


=a, n<t<n+1, 


so that the new martingale becomes that of Example 1. In this case 
T(x) is the set of non-negative integers and T(z) the set of non-negative 
real numbers. 

Example 5 Let fx, O< t< œ} be a separable Brownian motion 
process (see Example 2) with x = 0. It will be proved in VIII that 
almost all sample functions of this process are everywhere continuous. 


S11 CONTINUOUS PARAMETER MARTINGALES 381 


For each œ corresponding to a continuous sample function let 7(w) be 
the first parameter value at which the sample function takes on the value 
d. Here d is a non-zero constant, to be held fast in the following dis- 
cussion. The random variable 7 satisfies the condition OS,’ and can 
therefore be used to define a system of optional stopping, yielding an #, 
process with 

#=2(0), 2h 

= thol), tlw) <1. 
According to Theorem 11.8 the ž, process is a martingale, with 
E{z,} = 0, 

even though we have now of course lost the symmetry of the «, distribu- 
tion, Continuing the discussion of this example, we now show that 
E{7} = œ. To do this, define a system of optional sampling, with T(r) 
containing only a single point, and let the corresponding 7, be 7. Then 
the hypotheses of Theorem 11.8 would not be satisfied, or we would have 


0 = Efe} = Ef#,} = Ef} = d 40, 


because x, =d by definition of 7. In particular, Condition Cy of 
Theorem 11.8 cannot be satisfied. However, (11.18) is satisfied with 
K = E{|x|}. 

Hence (11.19), the last part of this condition, cannot be satisfied, that is, 
Er} = œ. 

Example 6 Let fy 0< t < ©} be a separable Poisson process, with 

w=% Ely} = er 

It is shown in VIII that the sample functions of the y, process are almost 
all continuous except for unit jumps. For each w corresponding to such 
a sample function let z(w) be the parameter value where the sample 
function has its first jump. Then 7 has density of distribution ce~", 
s > 0, and 


E{7} = z 


Define 

= Yi— ct. 
Then the x, process is a martingale. We shall illustrate Theorem 11.8 
by an example in which we can make explicit calculations. Define a 
system of optional sampling in which 7(7) contains only a single point, 
with corresponding 7, = 7, So that Ž, = %,,. Then 


4, = Bq = 1— er. 


382 MARTINGALES VIL 


Theorem 11.8 is applicable (using Condition Cs), and we find that 
0 = Efx} = Efx} = E{l — cr}. 


This equation checks our previous evaluation of E{z}. The example also 
illustrates the continuous parameter version of a sequential analysis result 
discussed above. 

[Theorem 3.1s (iv)] Let {t,a<t< b} be a separable semi-martingale, 
and suppose that almost all sample functions of the process are continuous. 
Then lim x, exists and is finite almost everywhere where lim sup x, < 00. 

tb tb 


Before proving this, we remark that by Theorem 11.5 the w sets 


{L.U.B. x(w) < ©}, {lim sup x(w) < 2} 
t tb 


differ by at most a set of probability 0. Let (w) be the first parameter 
value (if any) at which (w) = d, where d is a constant, and define a 
system of optional stopping based on 7. Then the #,—d process has 
non-positive random variables, and is a semi-martingale. Hence, ac- 
cording to the general parameter set version of Theorem 4.1s (i), lim #, 
tod 
exists and is finite with probability 1, that is, lim «,(w) exists and is 
tob 
finite almost everywhere that L.U.B. x(w) < d, that is, almost every- 
t 
where that L.U.B. x(w) < ©, since d is arbitrary. This completes the 


t 
proof. Essentially the same reasoning would derive the same conclusion 
if the hypothesis of continuity of the sample functions were replaced by 


the hypothesis 
E{L.U.B. (£ — %,-)} < 2. 
t 


(The measurability of the indicated L.U.B. would also have to be assumed.) 

We now make a slight digression into discrete parameter martingale 
theory. Let y,,° * *,Y, be real random variables, with 

$ 
Er = È Ys 

satisfying the conditions A 
0129 Byj=0 Elw sy j> 
and 
(11.24) EYyZ}=o%, Eyf lyn a= jot 


with probability 1. Here o4, * - *, On are non-negative constants, Define 


k 
oe 2 of: 


$11 CONTINUOUS PARAMETER MARTINGALES 383 


We have already remarked that (11.23) simply states that the process 
{x;, j <n} is a martingale, with 
Ei A ye 
Let F, be the Borel field of the œ sets determined by conditions.on 
2,°* +, Xy that is, on Yy * * *, Yj Then (11.23) and (11.24) imply 
Erp P (FiS gat ia joi 
with probability 1. In other words, if (11.23) and (11.24) are true, the 
process {x,2— 62, F, j< n} is also a martingale. (One way of looking 
at this fact is to observe that the x process is a semi-martingale, since 
the x, process is a martingale, and in the representation (1.5’) of this 
semi-martingale we have the case in which the A,’s reduce to constants.) 
Conversely, if (11.23) is true, and if {x} — 67, F; j < n} is a martingale, 
then (11.24) is true. In particular, (11.24) is true if the y;s are mutually 
independent, with zero expectations and finite variances. Now in this 
case the central limit theorem is applicable (see III §4), and shows, under 
appropriate further restrictions, essentially that Max |y,| is small, that 2,, 
i 


has an approximately Gaussian distribution. The characteristic function 
proof is simply that, if ®; is the characteristic function of y; and ® that 
Of Zas A 
y 

log ®,(2) = — a zB 
(approximately) so that 

n G2 

log D4) = > log ®,(2) = — z 

4 
(approximately). Now the reasoning used here is applicable in the general 
case, when the y,’s only satisfy (11.23) and (1 1.24), as follows. Neglecting 


error terms, i 5 ’ 
(2) = Efe} = Efe” | Efe | F,.}] 


: o? 
= Efe [1 — 5-7} 


272 
T pie] = =) j>l 


so that ger 
log 0A) — log ®;_,(4) = — 57 TSE 


approximately. Adding these equations, we get 
nA 
2 


log OA) = — 


384 MARTINGALES Vil 


approximately, as in the case of sums of mutually independent y,’s. Thus 
it is clear that the central limit theorem is applicable to martingales much 
as it is to sums of mutually independent random variables. This fact 
was first observed by Lévy. We shall not go into details in the discrete 
parameter case, but we shall need one continuous parameter result in 
this direction. We shall discuss real martingales {x,, F, a< t< b} 
with Efx} < œ. The #,—2, process is then also a martingale, with 
respect to the same fields, so the (x, — %,)? process is a semi-martingale 
with respect to these fields. Hence the function F defined by 


F(t) = E{@,— %,)} 
is monotone non-decreasing, and as a matter of fact it is trivial to verify 
that the x, process has orthogonal increments, so that 
F(t)— F(s) = Efe th s< 
The fixed points of discontinuity of the x, process are the points of dis- 
continuity of F. In analogy with (11.24) we impose the condition 
(11.24) E{(z,—2,)? | F,} = Ft) — Fs), 
with probability 1, for each pair s, t with s < t. This is equivalent to 
the condition that the process 
{x?— F(t), Fpa <t <b} 
be a martingale. In particular, if F is a continuous function, that is, if 
the x, process has no fixed point of discontinuity, the change of parameter 
from f to t, where t = F(t’), reduces the situation to the special case in 
which F(t) = t— a, and this is the case we now treat. 

THEOREM 11.9 Let {x,,.F,, a< t< b} be a real martingale, and suppose 
that almost all sample functions of the process are continuous. Suppose that 
Eft7}<0, a<t<b, 

and that, for each pair s, t with $s < t, 
(11.25) E{(@,—2,? F}]= t-s 


with probability 1, that is, we suppose that {x} — t, F, a< t < b} is a 
martingale. Then it follows that the x, process has independent increments, 
and is in fact a Brownian motion process. 

We shall see that the main point of the proof is that x, — x, is Gaussian. 
This character of the distribution is intuitively clear, because, for large n, 
Ty — va can be expressed as a sum of small random variables 


n 
Eee eee È Ys Ys = Tatcjin — Vateli-i);jn c=b—a. 


The y,’s satisfy (11.24) with o? = c/n, so that one naturally expects the 
central limit theorem to be applicable to give the desired result. The 


$11 CONTINUOUS PARAMETER MARTINGALES 385 


following formal proof of this theorem exhibits the simplifications that 
can be effected by a proper use of Theorem 11.8. To simplify the notation 
we take a = 0, b = 1. Moreover, we shall suppose, as we have shown 
we can, that each Z, contains all œw sets of probability 0. Replacing 2, 
by x,— a if necessary, we can suppose that x)= 0. Let ebe a positive 
number, let n be a positive integer, and let 7(n, £, œ) be the first value of t 


for which 
Max |z (0)— 2,()| = £, 


PAS In 
or 1 if there is no such value of ¢. Here we ignore the discontinuous 
sample functions. Then 0 <7 <1 and the condition 7(w) > s involves 
restrictions on continuous sample functions only at parameter values 
<s, so that {r(w)>s}eF,. In fact, considering only continuous 
sample functions, if 0 <s <1, and if r}, 7, are restricted to be rational, 


fo) > s} = O QO  {\lx,,(@) — 2,(@)| < e— 1/m} EF, 
m=1 0Sn,m58 
lr =r S1/n 
Then {r(w) < s} € F „ and we use 7 to define a system of optional stopping, 
so that &, is given by 
E(w) = x(w), t<7(), 


=o),  7(w) <t. 


Since both the a, and x? — t processes are martingales, Theorem 11.8 (i, 
(iii), states that the processes 


{é,F,0<t<1}, {#2 Min [1,7], F%,0<1<}} 
are martingales, so that 


Í {z2 — Min [s, 7]} dP = J {%2—Min[t,7}dP, AcF, s<t. 
A A 


Then 
iF Efg?—#2| Fj dP = f (@2— #2) dP 
A A 


= Í (Min [ż, 7] — Min [s, 7]) dP 
A 


< fa-sar. 
A 


The integrand on the left is measurable with respect to the field F,, and 
this inequality holds for every A e F,. Hence 


(11.26) E@?—#2|FjJ<t—s 


386 MARTINGALES VII 


with probability 1. The č, process thus has almost the same three. basic 
properties as the æ, process: the process {#,, ¥,, 0< t < 1} is a martin- 
gale; the č, process sample functions are almost all continuous; the 
equality (11.25) is weakened to (11.26) (which is equivalent to the state- 
ment that the process {%2 — t, Ž „ 0< t < 1}is a lower semi-martingale). 
Moreover, the #, process is simple in that small ¢ increments are uniformly 
bounded. In fact, if 
Y; = Lyn — Hs ayn =p eG 

then 
(11.27) ly] < €. 
The martingale property of the #, process and (11.26) imply 
(11.28) 

Ey} = 0, Ey; |Y + 5 Y;a}=0 
1 
7 


1 
of = Ely} < of = Ey? (Ys > Yab< a 


with probability 1. We shall prove that #, is nearly normally distributed, 
using (11.27) and (11.28). We shall choose n = n(e) as described below, 
so that n —> co when e —> 0. 
ESI 
Efe} = Efeti-0/n Efe” EA ERTS Yj) 


A R 
= Feit [1 fe. o? F a + oc} 


ve pfen- azara) 


where o(1) here and in the following represents any expression which goes 
to 0 with ¢, uniformly in any other variables involved as long as A is 
restricted to a finite interval, whereas O(1) represents any expression 
which remains bounded under the same conditions. It follows that, if 
J21, r A 
[fetn} — pfen 2) 


-penake y) 
< Efez ier) tern _ 1)} 


<0 e{(*— op) +°) 


<o0n |- eua] + ©. 


$11 CONTINUOUS PARAMETER MARTINGALES 387 


Then 


| Efeiinjean”™ — Efe hea 


1 
<0(1) [: = Etys)| Be 
Adding these inequalities, we find 
| Efe*Je"? — 1] < ODU- > Ely) + ol) 
j=1 


o(1) 


SE) peal 
n Ley 


= O(1)[1 — Ef} + o(1), 
so that 
|E{e*4} — e”? < O(1)[1 — Ef} + o(1). 
Now, for a given e, P{a#,(w) = x,(«)} can be made arbitrarily near 1 by 
choosing n large. Hence the characteristic functions of č, and a, can be 
made arbitrarily close to each other, uniformly in any given finite interval, 
by choosing 7 large. That is, n = n(e) can be chosen so large that 


|Efe*} — e-*2| < OMIL — Eft}] + o(1). 


Moreover, since o? <.1/n, we know that E{%,°}< 1, so that 


0<1-E¢}=E@2—-2= | eed 
{(@) <1) 
<< x? dP. 
{r(w) <1} 
Now we have chosen n large to make P{r(w) = 1} nearly 1 so that 
P{x(w) = é,(w)} will be near 1. Hence the last integral in the above 
inequality is o(1), and this proves that 
Efe} =e"? + o(1). 
Since only the last term o(1) can depend on £, this term must actually be 
0, so that x, must have a Gaussian distribution with expectation 0 and 
variance 1. A trivial modification of this discussion shows that, if 


OS sen ey aig 
k 


k Si 
‘2 Ilt- tiy-) i EAO 24-0) 


Efe y= fe? 


ee tra), 


Then 
k 
iS 


E Attya), Li 
wid TT unit 
j=l 


that is, the x, family is Gaussian and has independent increments, as stated 
in the theorem. 


388 MARTINGALES VII 


12. Applications of martingale theory to sample function continuity of 
yarious types of processes 


(a) Application to Markov processes Let {x,,a<t < b} be a Markov 
process with a specified transition probability distribution function. 
That is, we suppose that there is a given function p of &, s, 7, t, defining 
a Baire function of & for fixed s, 7, t, and defining a distribution function 
in y for fixed £, s, t. For each é, s, t, with s < t, it is supposed that 


PEs S; N, t) = Pla Lo) < 7 |}, 


with probability 1, and it is supposed that the Chapman-Kolmogorov 
equation is satisfied identically in &, s, n, t for s < t, 


PEs; 1, t) = JG u; n, t) dpl, s; Gu), s<u<t. 


Then the stochastic process {x, a < t < b} defined by 
&, = p(, t; 7, b) = P{x,(w)< 7 |2,,5< t} 


is a martingale, since as f increases more and more conditions are intro- 
duced in the conditional probability on the right. The known continuity 
properties of martingale sample functions imply corresponding properties 
of the x, process sample functions. For example, if the Markoy process 
is a chain with stationary transition probabilities (see VI §1), we define 
#, (modifying the above definition slightly) as 


%, = Prb — t), 
where ] 
Palt) = Parr = j |20) = i}, 


and the continuity properties of the #, process sample functions can be 
used to derive the fact (seg VI, Theorem 1.4) that the sample functions 
of a separable chain with a finite number of states are almost all step 
functions. 

(b) Application to processes with independent increments Let {x,, 
a<1t<b} be a real separable process with independent increments. 
Suppose that for each t e [a, b] the limits 

lim 2 = 2, lima, = 2 ,, 

stt sht 
exist and are finite with probability 1. It is proved in VIII, Theorem 6.3, 
that then almost all sample functions of the process are bounded. 
(Theorem 6.3 assumes that the x, process is centered, which besides the 
above conditions imposes the condition that a difference x,— x, or 
Zy —%, is not identically constant with probability 1 unless the constant 


$12 APPLICATIONS TO SAMPLE FUNCTION CONTINUITY 389 


is 0, but this condition is not used in the proof.) We now apply martingale 
theory to the study of the continuity properties of the x, process, and we 
prove Lévy’s result (VIII, Theorem 7.2) that almost all sample functions 
of the process are continuous except for discontinuities at which both left- 
and right-hand finite limits exist. To prove this let ©, be the characteristic 
function of x,— Ta 
Ou) = Efe *0} 

and define 

O ANE) 

a Oy) ` 
This definition assumes that p is chosen so that ®(u) #0. If 6> Ois 
chosen so small that 


Ou) 40, [nl <d, 


the equation p 
Du) = DA WEE" =} 
shows that 
O(u) #0, Sa TASES 


Thus, for each u with |u| < ô, there is an žų process. Let F, be the 
Borel field of the œ sets determined by conditions on X, — a for s < t. 
Then the process lip Fpa STS b} is a martingale, because, ifn 


i Y CA gitta— ta) 
D (u) 


with probability 1. In the following we shall choose a set R in the 
interval [a, b]. A function g defined on [a, b] will be said to have the 
property Cp if it coincides on R with a function defined on [a, b] which 
has left- and right-hand limits at all points of [a,b]. In other words, 
g has the property Cp if it can be redefined at the points of [a, b]— R in 
such a way that the resulting function has left- and right-hand limits at 
all points of [a, b]. Let R be any denumerable subset of [a, b]. Then, 
even though the #, process (with j fixed) is not necessarily a separable 
process, we have seen that Theorem 11.5 is applicable to give the result 
that almost all sample functions of the #, process have the property Cr 
Now by our hypotheses ,() has for each x left- and right-hand limits 
in rat all points of [a, b]. Hence for each u almost every sample function 
of the x, process has the property that the 7 function 


Efe, | F} = =a, 


etlo) -talo 


has the property Cp. Then almost every sample function of the x, 
process has the property that simultaneously for all rational w in the 


390 MARTINGALES VII 


interval [— 6, 6] the corresponding ¢ function above, and hence its 
imaginary part, has the property Cp. Now if a real function f(-) is 
bounded, and if |uf | < 7/2, then sin yf has the property Cp if and only 
if f has this property. It follows that almost all sample functions of the 
2, process have this property. By hypothesis, the x, process is separable. 
Let R be the parameter set involved in the definition of separability. Then 
the fact that almost all x, process sample functions have the property Cr 
means that almost all have finite left- and right-hand limits at all points 
of [a, b], as was to be proved. 

(c) Application to Gaussian processes It is instructive to apply martin- 
gale theory to the study of the conditional expectations involved in 
Gaussian processes. Suppose first (discrete parameter case) that 


{x,,n > 0} is a Gaussian process. Then Efzo |x, * 4 %,} is a linear 
combination of xj, * + +, %&„ and, since (Theorem 4.3, Corollary 1) 
EG ty at,” = lim Efe zp in} 
no 


with probability 1, the conditional expectation on the left, as a limit of 
Gaussian variables, is itself Gaussian. More generally, it is clear that 
any conditional expectations of «,,’s, for finitely or infinitely many given 
a,'s, have a joint Gaussian distribution since these conditional expectations 
are the limits of linear combinations of xs. (The Gaussian distributions 
may of course be degenerate.) In the general case if {a,, te T} is a 
Gaussian process, consider a conditional expectation of the form 


Efx, | x,, 5 € Ty}. 
There is a finite or enumerable sequence {s,,} in T, such that 
Efe, |£, 5 € Ti} = Efe, [2,7 = 1} 


with probability 1 (I, Theorem 8.2), so that according to the discrete 
parameter analysis just made the conditional expectation under con- 
sideration must be a Gaussian variable, and more generally. any set of 
conditional expectations of this form is itself a Gaussian process. As an 
example, suppose that #, is defined by 


#, = Efe, | x,, s < t}: 


Then the #, family is a Gaussian martingale. (It is easily seen that a 
Gaussian martingale which has no fixed points of discontinuity must be 
a Brownian motion aside from an additive constant and a change of the 
time parameter, because the family has uncorrelated increments.) We 
observe that %, can be considered a prediction of «,, knowing the “past” 
of x, up to time ż (see XII). 


CHAPTER: VATE 


Processes with 
Independent Increments 


1. General remarks 


Processes with independent increments were defined in II §9. The 
integral parameter case has already been discussed; in fact we have 
remarked in II §9 that, if the process with random variables £, 2a, °° ' 
has independent increments, x, is the nth partial sum of a series of 
mutually independent random variables, and such sums have been 
discussed in III. In practice the nomenclature “independent increments” 
is ordinarily used only in the continuous parameter case. The nomen- 
clatures “‘differential process,” “additive process,” and “integral with 
independent random elements” have also been used. The latter nomen- 
clature is suggested by the formal expression 


Le — Va = [dee 
a 


in which the differential elements are mutually independent. 

Processes with independent increments have wide sense versions, 
processes with uncorrelated or orthogonal increments. The latter are 
treated in IV (discrete parameter case) and IX (continuous parameter case). 

It will be convenient in several of the following sections to use the 
notation a(t) rather than 2, for the general random variable of the process 
under consideration. 

If {x,, t¢ T} is a process with independent increments, and if f is a 
function defined on T, then {x,— f(t), t e T}is a process which also has 
independent increments. It will be convenient below to replace v, by 
x,— fit), where f is chosen to obtain simple continuity properties for the 
new sample functions. Moreover, if T has a minimum value a, it is 
usually convenient to replace x, by £, — Xa, to obtain a new process, also 
with independent increments, whose sample functions all vanish at a. 

391 


392 PROCESSES WITH INDEPENDENT INCREMENTS Vill 


2. Brownian movement process 


This process was defined in II §9. It is a real process, {a(t), te T}, 
where T'is usually taken to be an interval, in fact usually either (— 00, 0) 
or [0, 00), with independent Gaussian increments satisfying 


(2.1) E{2(t)—a(s)}=0, Ef{[x(t)—a(s)P} = o°|r— sj, 

_ Where ø is a positive constant. The process is of great importance 
because of its central role in the theory of stationary Gaussian processes 
(see X and XI) and because of its numerous physical applications. 

The inequality 


w 3/2 


/2 


(2.2) fea ! I te="! dE = — (A> 0) 
a A 


will be useful below. 
THEOREM 2.1 (Separable Brownian movement process) 


QD  P(LUB. [t o) — 2(0, o)] = 2} = P(T, o) — 210, o) = 2} 
sts 


o 2r —24/(201") 
<3 ie g : 


Suppose 0 = tọ <+++<t,=T. Then, according to III, Theorem 2.2, 
(2.4) Pf Ls Let, w) — 2(0, w)] > A} < 2P{a(T, w) — 20, w) > J}. 
<n 


Now the left side of (2.3) is obtained by choosing a denser and denser 
Sequence t,,* * *, fa and going to the limit (II, Theorem 2.2). Hence 


2.3) P{LUB. lelt, w) — a(0, w)] > 4} < 2Pf{a(T, w) — x(0, w) > 2}. 


To prove the reverse inequality set 1, = j7/n and apply the other half of 
II, Theorem 2.2, obtaining, for each e> 0, 


P(L.U.B, [x(t, w) — x(0, w)] > A} 
= P{Max [x(,, w) ~ 20, w)] > 2} 
<n 
= 2P{a(t, w) — 2(0, w) > 4+ 2e} — mela(?, o) —2(0, w) > e) 


2n eH ENT G/T 
Vm eVn 
> 2P{2(T, w) — x(0, w) > ay (n= œ, e> 0). 


> 2P{x(T, w) — x(0, w) = 4 + 2e}— 


eee 


§2 BROWNIAN MOVEMENT PROCESS 393 


THEOREM 2.2 Almost all sample functions of a separable Brownian 
movement process are continuous. 

Suppose for definiteness that the range of the parameter is the interval 
[0, œ). We prove that 


Wes 
(2.5) x(t.) —2(4,0)] <n if -H< jaN, 


for sufficiently large N, except for an w set of probability 0. Using 
Theorem 2.1, 


P| L.U.B. Lee »)—2(4,0)| > nash 


KJIN SIN | 
Tsj EN 


< nep| L.U.B. |a(t, w) — x(t, o) Es vail 


|t-1/N|<1/N 


2oV2 


mT 


N74 gm2 


The last term is the Nth term of a convergent series; hence (Borel- 
Cantelli lemma) the event whose probability is being measured happens 
only a finite number of times, with probability 1; that is, the indicated 
L.U.B. is < N~4 for large N, with probability 1. Thus (2.5) is true, 
and (2.5) states (with specific estimates of the «’s and 0’s involved) that 
almost all sample functions are uniformly continuous in every finite ¢ 
interval. 

The proof of continuity of the sample functions was made dependent 
on the evaluation of P{ LUB. [x(t, w) — (0, w)] = 2} in Theorem 2.1. 

osts 

Actually, all that was needed was an upper bound for this probability, 
that is, only half of Theorem 2.1 was needed, (2.3). Suppose then that 
instead of (2.3) only (2.3’) had been proved. The proof of Theorem 2.2 
would need no change. The exact evaluation (2.3) and similar exact 
evaluations are easily made, once sample function continuity has been 
proved, using what is known as the reflection principle of Desiré André. 
To illustrate the method, (2.3) will be derived, assuming only sample 
function continuity. 

Obviously 


(2.6) P{ Ais : [x(t, w) — 20, )] =A; 2(T, w) — x0, w) => A} 
ga = Pia(T, w) — a(0, o) = 2}, 


394 PROCESSES WITH INDEPENDENT INCREMENTS VIL 


where we can write Max instead of L.U.B. because the sample functions 
are continuous (with probability 1). On the other hand, consider the 
continuous sample functions satisfying 


(2.7) Max [2(t, w) — (0, w)] = 1, (T, w) — 20, w) < 2. 
Ost<T 


If r(w) is the first value of t for which a(t, w) — a(0, w) = A, the changes 
in a(t, w) after r(w) are independent of the changes before 7(@), and are 
equally likely to be positive or negative; we shall not change the prob- 
abilities by reflecting the sample curve for t> 7(w) in the line « = 4. 
That is to say, 


(2.8) P{ Mex [x(t, w) — x(0, w)] = A, a(T, œ) — x(0, w) <A} 
= P{ Max [a(t, w) — x(0, w)] = A, &(T, w) — 2(0, w) > å} 
o<t<? n 


= P{a(T, w) — x(0, w) > A}. 


Adding (2.6) to (2.8), we obtain (2.3). This proof is, of course, not 
complete without further elaboration of the italicized statement. The 
point is, however, that the parallel approximate analysis, for ¢ running 
through only a discrete set of points T/n, 2T/n, + > +, T (which we have 
carried through) gives in the limit (n —> 00) the exact analysis, and, if 
desired, the italicized statement can be considered an abbreviation of this 
limit reasoning even if the rather delicate detailed justification of the 
italicized statement is omitted. 

The sample functions of a separable Brownian movement process, 
although (almost all) continuous are very irregular in nature. It will be 
shown below for example that, for each fixed fo, 


(2.9) lim sup ald am ale 


tot, t— to 


with probability 1. In other words, at each fy the sample functions have 
infinite upper derivatives with probability 1. This means (by II, Theorem 
2.5 this process is measurable) that almost all sample functions have the 
property that the upper derivates are + oo for all values of ¢ except for 
a set of values of ¢ of Lebesgue measure 0. The exceptional ¢ set will 
vary from sample function to sample function. To prove (2.9) suppose 
that A is positive and consider the probability 


PLB, EO Al ®) . y, 


to<t<ty +d t= = 


§2 BROWNIAN MOVEMENT PROCESS 395 


It is sufficient to prove that the probability goes to 1 when 60, and 
this follows from 


P| L.U.B. w(t, w) ~ (toy w) 


to<tSto+6 t— t 


= i} > P{Max [x(t, o) — 2(fo, w)] = 45} 


tyStStot6 


= 2Pfx(ty + ô, w) — (to, w) > 26} 


2P( + 6, 2 X(t, @) >A val -1(6 > 0) 


since [a(t + ô, ) — (to, @)]/'V6 is normally distributed with expectation 
0 and variance o? independent of ò. 

We can state even more striking facts about the irregular character of 
the Brownian movement sample functions. Let f be any fixed continuous 
function of ż, and suppose that 0 = tọ <" ++ < t, = T. Then 


n=1 i n 
> Uf Ga) fP Max | f(t) —fit)| > Sind AO. 
j=0 jsn-1 j=0 


Hence, if the function is continuous in the interval [0, 7], and if the sum 
on the right is bounded independently of the choice of n and of the 1,’s, 
that is, if f is of bounded variation (or, geometrically, if the graph has 
finite length), the sum on the left will go to 0 when Max (tj, — t;) > 0. 


jsn-1 
The following theorem, according to which this sum approaches 0 for 
almost no sample function, thus shows that almost no Brownian move- 
ment sample function has bounded variation. 


THEOREM 2.3 (Brownian movement process) Let to, ty, * * * be every- 
where dense in the interval (0, T]; and let tọ™, >> +, 1,” be the numbers 
to © * ` tn arranged in numerical order, £50) E EA 


n-1 
lim $ elta) — 2) = oT 


no j=0 
with probability 1, and the limit relation is also true in the sense of con- 
vergence in the mean. 

In fact, let S, be the sum on the left and suppose first that fo and f, are 
0, T so that to", + ++) tay ”t™ is obtained by inserting 1,4, between 
two of the #,("”’s. Then we shall show that the sequence ` > ', So, 9; 
(note the order) constitutes a martingale. It is sufficient to show that, 
for every pair of positive integers m, 7, 


E{S, | Saim * * Sar) = Snr 
with probability 1; that is to say, 
(2.10) E{S, — Snit |Sntm* * Saj = 0 


396 PROCESSES WITH INDEPENDENT INCREMENTS VIII 


with probability 1. The details of the symmetry argument used to prove 
this equation will bè omitted except for the case m = 2. (If the equation 
is true for m = 2, it is necessarily also true for m = 1.) There are two 
possibilities: either ¢,,, and f,,» are both inserted in one of the intervals 
(t, ty") or they are inserted in different intervals. The two are 
treated in the same way, and we shall consider only the first. In this 
case (2.10) is a consequence of the following statement: let x1, Xa, Vy, S 
be mutually independent random variables, of which the first three are 
Gaussian, with zero means. Then 


(2.11) 
Efl Heatta) Hiet) Hats? +S] | eHre Hes, (G+)? Hs) 
= E{2Q(a, + t)r | a? + aq? + wy? + 5, (a +2) + a? + 5} = 0, 
with probability 1. To prove this relation we observe first that (symmetry) 
E{2(y + tatz | wy, tos v} = 0, 
with probability 1. Next, since s is independent of the 2:,’s it follows that 
(2,12) E(t, + 2)arg | Lis Xa, 75°, s} = 0 


with probability 1, and (2.11) is deduced by taking the conditional expecta- 
tion of both sides of this equation with respect to the conditioning 
variables in (2.11). [The point is that the conditions in (2.11) are less 
restrictive than those in (2.12).] 

Since the sequence - + -, S,, S}, is a martingale, lim S,, exists with 


probability 1 (VII, Theorem 4.2). To show that the limit is oT we show 
that the Lim, is oT, 


n—1 
E(Sq— T)} = 208 S (641 — 1)? 
i=0 
< 20T Max (ta ™ — 45) +0 (1 > o). 
J 
This finishes the proof in the special case t) = 0, 4 = T. In the general 
case, if we define S,,’ by 
Sn’ = [xlto™) — 2O)P +S, + [(T) — a(t, 
it follows from what we have just proved that 
lim 8,’ = eT 


n>a 
with probability 1. The theorem is then true in the general case, because 
i x(t) = z(0), lim a(t,() = (T) 
+00 n+ 


with probability 1. 


§3 APPLICATIONS OF THE BROWNIAN MOVEMENT PROCESS 397 


3. Physical applications of the Brownian movement process 
Let x(t) be the x coordinate at time t of a particle in some medium. 
Much of the following analysis is applicable to cases as different as: 
(a) The particle is a molecule of a liquid or gas. 
(b) The particle is of microscopic size, in a fluid, say a colloidal particle. 
(c) The particle is a star; the medium is the stellar universe. 


The English biologist Brown observed in 1826 that a particle in case (b) 
makes irregular apparently spontaneous movements, now known to be 
caused by the impacts on it of the molecules of the medium. This motion 
is called Brownian motion, and the Brownian motion (or “movement’’) 
process is used to analyze this motion. The essential condition of the 
following analysis is that there are only negligible bonds between the 
particle and those of the surrounding medium, except at the times of 
impact. The analysis is thus applicable, for example, to the motion of a 
molecule of a gas at low pressure. An “impact” is the proximity to a 
particle under analysis of the force field of a particle of the medium; it 
is unnecessary to identify an impact with something like a collision of 
two billiard balls. 

The impacts on a Brownian particle follow each other in an irregular 
fashion, at a high rate (in the relevant time scale), and the displacement 
component a(t + s) — x(t) is thus the sum of a large number of small 
displacement components, if s is large compared to the time between 
impacts. It is natural to consider the function 2(:) as a sample function 
of a stochastic process, and the problem is to describe this continuous 
parameter process. It is supposed that the medium is in macroscopic 
equilibrium, and this hypothesis is reasonably translated to mean that the 
distribution of a(t + s)— a(t) is symmetric, and does not depend on t. 
As a first approximation it is supposed that the random variable 
a(t + s)— z(t) is for s > 0 independent of the motion at times well before 
t, and this leads to the still stronger hypothesis that the x(t) process has 
independent increments. The central limit theorem now suggests that 
a(t + s)— a(t), as a sum of nearly independent small random variables, 
has a Gaussian distribution. If these hypotheses are all accepted, the 
x(t) process must be what we have called the Brownian motion process. 
In the case of a particle in a fluid the parameter o? of the process can be 
identified with 2 D2, where D is the diffusion constant of the fluid, so that 

E{(x(t) — «(0))} = 2D*. 
This formula is due to Einstein. As noted by him, and as the above 
“derivation” makes clear, the formula can only be expected to be a crude 
approximation when ¢ is of the order of magnitude of the time between 
molecular impacts. In any event, the reasoning used here, undisturbed 


398 PROCESSES WITH INDEPENDENT INCREMENTS Vill 


by any specific particle dynamics, can hardly be considered more than 
suggestive. Indeed a more sophisticated mathematical approach might 
have concluded only that the æ(r) process should have stationary inde- 
pendent increments, and that the distribution of any increment 
x(t + s) — x(t) should be infinitely divisible (cf. Ill, §4). The Brownian 
movement process is not the only one satisfying these conditions (cf. §7 
of this chapter). However, the others could be excluded by examining 
the exact hypotheses of the relevant form of the central limit theorem and 
then asserting flatly that these seem to be reasonably well verified experi- 
mentally, or by noting that (cf. Theorem 7.1) the others do not have 
continuous sample functions, and discontinuous trajectories are repugnant 
to practical sensibilities. However much or little such a discussion is 
convincing, the fact is that it at least suggests the applicability of a 
separable Brownian movement process, and this mathematical model has 
in fact been checked empirically, or at any rate a few of its implications 
have been checked. It is satisfactory that the sample functions of the 
process are continuous; it is less satisfactory that (cf. §2) these sample 
functions are not of bounded variation, so that the trajectories have 
infinite length. Moreover, the velocities do not exist; in fact, we proved 
in §2 that the upper and lower derivates of a sample function at a given 
point £ = fy are + co and — oo respectively with probability 1. These 
properties of the process should not be taken too seriously from a practical 
point of view since they are in the small properties of the sample functions, 
involving increments x(t + s) — a(t) with s small, and we have already 
remarked that the fit of theory to practice cannot be expected to be good 
for properties involving small increments. Closer analyses of Brownian 
movements have led to a somewhat different mathematical model with 
an 2(t) process which has sample functions with continuous derivatives. 
As would be expected, E{[x(t + s) — x(t)}*} ~ s*E{a’(1)*} is of the order 
of s? for small s in this process. 


4. Poisson process 


This type of process was defined in II §9, A Poisson process is a real 
process with stationary independent increments; the increments are 
integral-valued, and 

ells), — sym 
(4.1) Pfx(t, 0) —2(s, w) = m} = <2 ea t>s, 
m! 

m=0,1, 2 °° 
where c is a positive constant, We shall need the expectations 

Efx(t) — 2(s)} = ots) 
(4.2) t>s. 

Effz) — z(s) — elt — s)F} = oft — s) 


§4 POISSON PROCESS 399 


According to (4.1), P{x(t, œ) = 2(s, @)} = 1, for fixed s and t. Then, 
if f, fa * * * is any sequence of t values, a(t) — (0) is monotone non- 
decreasing and integral-valued, if £ is restricted to the 1's, with probability 
1. That is, almost all sample functions (if considered defined only on the 
t;s) are monotone non-decreasing, with integral-valued increments. 
Now (II §2), if the process is separable, the sequence {t,} can be chosen 
in such a way that almost every sample function æ(ż) has the same upper 
and lower bounds on every open interval as on the 4,’s in the interval, 
Hence, excluding a collection of sample functions of probability 0, each 
sample function has the following properties: it is monotone non- 
decreasing; it increases only in jumps, of integral-valued magnitudes, 
that is, if the sample function f(-) has a jump at to 


Sit) < fh) <= flio +) = fltg—) +n, 


where n'is a positive integer; f(/)—/(O) is then integral-valued except 
perhaps at the jump points. Moreover, we shall now show that the 
amount of the jump, n above, can be taken as 1, that is, the probability 
of the class of sample functions which ever have a jump of magnitude 
greater than 1 is 0. Tt is obviously sufficient to prove that the probability 
of a jump of magnitude greater than 1 at some point in any given finite 
interval, say (0, T) is 0. This follows from the limit equation 


P\ Max pén 7,0) —2 = T o)| > i 
O<j<n n n 
ask yt Jima 
<'S pfe -T,w})—2% -T,wy> 1 
n n 
=(n— p(t — eT eT In  g7%Tin z2) >0 (n=> &), 


since, if a sample function has a jump of magnitude greater than 1 in the 
interval (0, T), the indicated maximum will be greater than 1 for every n. 

The sample functions of a separable Poisson process are not continuous 
functions; in fact, the probability of continuity throughout an interval 
(t, t+ T) is e~", which goes to 0 when T becomes infinite. Neverthe- 
less there is continuity at each value f of ¢ with probability 1, since 


Plato + £, w) — (ty — £ 0) > O}=1—e%>0 (e+). 


The point is that the probability of continuity simultaneously at all the 


points of an interval is less than 1. 
It is convenient in some applications to make a semantic transformation 


here, and to describe each jump of a sample function as an event, The 


400 FROCESSES WITH INDEPENDENT INCREMENTS Vill 


number of events occurring within the interval (s, t) is then a(t —) — x(s +) 
(if we ignore completely here, as we shall in the following, all sample 
functions except those which are monotone and increase only in unit 
jumps). We observe that we have just shown that æ(t —) — as +) = 
2(t) — x(s) with probability 1, so that in counting events in an interval, if 
zero probabilities can be neglected, it makes no difference whether the 
endpoints of the interval are considered or not. In the present termi- 
nology the first equation of (4.2) states that the expected number of events 
in an interval of length / is c/. The constant c is thus the (average) rate 
of occurrence of the events. 

If events occur in accordance with the Poisson law as described here, 
they are sometimes described as “purely random,” or in the physical 
literature sometimes simply as “random.” This distribution of events 
can be described as follows, without the terminology of stochastic 
processes : ' 

Events are occurring in such a way that the probability that m will occur 
in the open time interval (s, t) is given by the right side of (4.1) (and this is 
then also the probability that m will occur in the closed time interval [s, t]); 
if [ly * * s Un are the numbers of events which occur in n intervals of time, 
disjunct except possibly for endpoints, then pı, ** *, Hn are mutually 
independent random variables. Here n is an arbitrary positive integer. 


We observe that the conditional distribution of events in the interval 

(s, t), under the hypothesis that m have occurred there is that of m points 

chosen independently in the interval, each with constant probability 

density 1/(ż— s), so that the probability density that the m points will be 

at hs * * ‘5 tm is m!/(t— s)". In fact, the probability that m, © © -, m, 

events occur respectively in non-overlapping intervals in (s, t) of lengths 
n n 


Gy 4, 2, with 24; = (t— s), if m = > m; events are known to have 
1 


occurred in (s, £), is 


n 


m, 
| end (cl;)" 
m;! pe ES 


j=] [ | m! 
eats) [e(¢— s)]” Wael de Seta: -m,! 
m! 


which reduces to 
m 


! m 
(t = sy” if dé; 


if m of the m,’s are 1 (and the others 0) and if the corresponding /,’s are 
replaced by d&;’s. Thus the Poisson distribution of events can be obtained 


§4 POISSON PROCESS 401 


in a given finite time interval of length / by first determining the number 
u of events to occur in the interval, 
(cl) 


m! 


P{u(w) = m} = e 


and then choosing u(w) points independently in the interval each with 
density of distribution 1// in the interval. The Poisson distribution of 
events can be approximated in an infinite time interval, say (0, 00) in the 
same spirit, as follows. Choose vy points in the interval (0, T), choosing 
them independently, each with density of distribution 1/7 in the interval. 
Here up is for each T a constant which satisfies the equation 


for example, uy may be the integer closest to cT. Then when T —> œ 
the distribution of events approaches the Poisson distribution of events 
in the sense that, if 4, * * *, /, are any intervals, disjunct except possibly 
for endpoints, of lengths 4, * * *, /,, respectively, and if (7) is the number 
of points chosen in Z, then #(T), © © *5 u,„(T) are random variables for 
which x a 

T en (cls) x 


lim P{ufo) =m, j = 1,* * n} = | 


T= LI m;! 


In fact, if the interval (0, T) includes the Ijs and ifl = > l, m= 5 m, 
j= 1 


the probability on the left is qe 


Ep ay Gr = al A er! 
sä T T ml: © © m,!(un— m)! 

E (eee 

T, T Ti mil- = oma! 


a a = n" (T > œ), 


as was to be proved. 
It is important to find simple qualitative conditions under which a 


distribution of events follows the Poisson law. One such set is the 
following, written for the parameter interval [0, 00). Each event is 
identified with a point on the time axis, that is, at most one event happens 
at any moment. 

(a) Only a finite number of events occur in any finite time interval. We 
write x(t) for the number of events occurring in the interval 0 < s < t. 
Then 2(f) is a random variable for every t > 0; we define x(0) = 0. 


402 PROCESSES WITH INDEPENDENT INCREMENTS VII 


(b) The x(t) process has independent increments. 
(c) The distribution of x(t)— 2(s) depends only on t— s, that is, the x(t) 
process has stationary increments. 


To show that a process satisfying (a), (b), and (c) is a Poisson process 
let P(t) be the probability that no events occur in the interval [0, £), 


D) = Pat, w) — 20, w) = 0}, t>0. 


The function ®(¢) does not vanish identically or an event would be certain 
to occur in any time interval, no matter how small, contradicting (a). By 
(b) and (c), if s> 0 and if r> 0, 


(4.3) Ds + t) = O(s) (2), 0<0)<1. 
According to (4.3), if P(t) = 0, 
0 = Dr) = (69/2)? =: « - 


so that ©(¢) vanishes arbitrarily near t= 0, and therefore vanishes 
identically since © is monotone non-increasing. Since this implication is 
false, P(t) can never vanish. The only monotone non-vanishing solution 
of (4.3) has the form 

Dens 


for some constant c > 0. The case c = 0 corresponds to a degenerate 
Poisson distribution where no events ever occur. We finish the proof by 
proving the Poisson formula (4.1). It is no restriction to take s = 0. 
Since the probability that an event will occur at a given time is 0 (because 
lim ®(/) = 1), we have 

to 

(4.4) P{x(t, w) — x(0, w) = m} = P fat least one event occurs in each of 
m of the intervals (jt/n, (j + 1)t/n), 
j=0,: -+ n— 1, and no event in 
the remaining intervals} + q,,, 
where 
|qn| < P{two or more events occur in at least one interval (jt/n, i + Mt/n)}. 


Let A, be the w set described in the preceding line. Then, if œ e A, for 
infinitely many values of n, it follows that œ e A, for all large values of 7, 
and corresponding to this point œw there are infinitely many events in 
(0, t). Thus, by (a), A, converges when n —> co to an œ set’ of zero 
probability. The first term on the right in (4.4) has the evaluation 


n! tlet)" 
aeaa dad aie an Bae 


il (n > œ). 


Thus (4.4) becomes (4.1) (with s = 0) when n —> co. 


§4 POISSON PROCESS 403 


Note that, if property (c) is dropped, we no longer can write ® explicitly, 
but it must nevertheless be monotone non-increasing. If it is supposed 
that the probability is 0 that an event occur at any preassigned time, ® 
must be continuous. The argument just used shows then that, if ¥=log ®, 


P{a(t, w) — x(s, w) = m} = e =, S <i. 
m! 
In other words, the process is the Poisson process after a change of 
variable on the time axis. 

Finally, another important property of the Poisson distribution of 
events over the interval [0, <0) is the following: Let s, be the time to the 
first event, and for j> 1 let s; be the time between the (j— 1)th and jth 
event. Then Sı Sa * + * are mutually independent random variables with 
the common distribution function 

P{s(@)<ap=1—e*% 220 
=0 A<0. 
To avoid notational complexities we only consider sı and sy, In the first 
place 
P{s,(w) < 4} = Pfa(4, w) — #0, œ) > =1l-e%, fe=i0 
which verifies the statement for s}. In the second place 
P{sy() < ty S0) < to} 


a E pļe(*, o) 2(0, w) o s(t l fi o) z2, o) 
j n 


n 


+ > rfa(*, o) æ(0, w) = 0, 2 (/ ae tp v) (4, o) = 2}. 
j2 n es n 


Now the second sum here is o(1/n) as n > œ since it is dominated by 


n=l ct L= eh 1 1 
— city/n -ehin pein A o ol- 
2 S Mae n ) Ee UN NS j 


and the first sum is 


-1 
oe en citain—ets/n (< “JC — et) > (1—e “(1 — eh); n>. 
j=0 n 


Hence 
(4.5) Ps, <t, a a E le ™). 


404 PROCESSES WITH INDEPENDENT INCREMENTS VIIN 


To prove the reverse inequality we observe that the left side of (4.5) is 
at least 


= ; F P 
o Pe o) x(0, w) = 0, x(/ t : ti w) a(i N a) l; 
n 


j=0 n 


3 EEN 
2H + tes o) — x(/ ar ti o) > 1) 
n n 


it hay hi ‘ 
=p E i ft (ile eatin) 
j=0 n 


> (1— ey(1 — e-%), n> o. 


It is sometimes useful to consider a Poisson process with parameter 
running from — o to oo instead of merely from 0 to co. If this is done 
one can define s’, the total time between the events that immediately 
precede and follow a given time fy. Then s’ has the distribution of sı + Sp, 
so that it has density of distribution c?Ae~” as compared with density 
ce for the ss. In discussing the “time between events” one must be 
careful to distinguish between s’ and s;. 


5. Application of the Poisson process to molecular and stellar distributions 


Consider a set of points on a finite or infinite interval, say the stars of 
a one-dimensional universe. Let the coordinates of the points be © + +, 
Xo, #4," * *. It is supposed that the points are uniformly distributed, and 
the question is how they can move to preserve this property. In other 
words, what kind of motion is compatible with the statistical equilibrium 
of the system of particles as distinguished from the stationarity of the 
process governing the individual particles? The following two examples 
will clarify the possibilities. 

(a) Suppose that there are infinitely many particles on an infinite line, 
in a Brownian motion. If a(t) is the coordinate of one of these particles 
at time z, we have seen that the a(t) process is the Brownian movement 
process, for which Ef[x(r)—a(0)P} = 2Dt > œ (t > co). Hence the 
particle at any given time z will, if ¢ is large, be far out with probability 
near l. In other words, the individual particles tend to diffuse outward; 
the x(t) process is of course not Stationary. Nevertheless, it is intuitively 
reasonable that a system of infinitely many particles can present a picture 
of a stationary system; the particles from far out can change places with 
those near the origin. 

(b) Suppose that there are finitely many particles diffusing in the 
interval (a, b), being reflected whenever they reach a or b (“reflecting 
barrier”). In this case it seems reasonable that, defining x(t) as in (a), 


§5 APPLICATION OF THE POISSON PROCESS 405 


the a(t) process tends to stationarity as t > 90, regardless of the initial 
conditions, and in fact will be a stationary process if the proper initial 
conditions are imposed. In this case then, if the various particles move 
independently, the system will be stationary, regardless of the number of 
particles. 

We shall consider only the case of an infinite interval. The first thing 
to do is to define what one means by a distribution of infinitely many 
particles with density c over the é-axis. The definition is sometimes given 
in terms of a limit idea, as follows: choose x, particles in the interval 
(—1, I), choosing them independently, each with constant density of 
distribution 1/(2/) in the interval. Then let l—> œ and yu, © so that 
u/2l—>c. According to §4, this is simply an indirect way of stating that 
the points are distributed over the infinite axis in accordance with a 
Poisson process with parameter c. (Note that the parameter is & here, 
representing distance rather than time; the “events” are the «,’s, the 
particle positions. The events are now distributed over the whole axis 
instead of the half-axis.) 

We can drop this indirect approach, and suppose simply that for each 
t we have a sequence of random variables © > *, X(t), 2,(t),.* * +, where 
x(t) is the position of the jth particle at time t. At t= 0 the particles 
have the Poisson distribution over the é-axis, with ¢ particles per unit 
length, and are numbered so that: + -+ <a(0) <2,(0) <- * = We wish 
to find conditions on the x(t) processes which insure that at all future 
‘times the particles will still have the Poisson distribution over the -axis 
with the same density c. The x(t) processes are not mutually independent, 
because of the prescribed inequalities at t= 0. We suppose that the 
distribution of x(t) — (0) does not depend on j, for t> 0. Moreover we 
suppose that, for each t > 0, the classes of random variables 


{a,(t)— 2,0), — © <j < o}  {x{0),- 0 <j < a}, 
are mutually independent, and that the random variables of the first class are 
mutually independent. 

Let 1(@), © * > fn(w) be the number of 2,(t, w)’s, for some fixed 
t> 0, inn finite intervals J, * * *, In of lengths 4,:- *s In. The intervals 
are supposed disjunct, except possibly for endpoints. We shall prove that 
the particles are distributed in the same way at time f as at time 0 by 
proving ; 

; TEM 
(5.1) Piao) =m j=l, =] = — et, 
ja, 274" 
Note that this does not state that any individual particle has a position 
whose distribution at time t is independent of the time. In fact, if the 
individual particles are for example subject to the Brownian movement 


406 PROCESSES WITH INDEPENDENT INCREMENTS Vill 


process, then each one will tend to be far from Ẹ = 0 for large t. The 
statement means, however, that the system itself is in macroscopic equi- 
librium, considering only relative positions. If each particle is moving in 
the direction of increasing ë with constant unit velocity, the conditions 
imposed on the motion are satisfied, and (5.1) is obviously true. To prove 
(5.1) we note first that, if a < b and if Fis a distribution function, then 


o o b- n b-s 
Í [Fb — t)— F(a — t)] dt = fa [ are) e far) | dt=b—a. 
Bees So att o a-s 


It will be convenient to write F(/) for F(b) — F(a) if 7 is the interval (a, b], 
and we understand by F—1 the interval J translated through — ¢ units. 
The preceding equation can be written in the form 


(5.2) i} F(I— tdt =b—a. 


This equation remains true if 7 is an interval sum, interpreting F(/— t) 

in the obvious way, if b — a is replaced by the length of J. Let F be the 

distribution function of «,(t)—2,(0). The probability that m; of the 

particles at time 0 are in the interval || < «/2 and go into J, at time 
n 


f j= 1,- + +, nis the probability (summed over k > > m; = m) that k 


+ ie. ah . . ee 
of the particles are initially in the interval |¢| < «/2 and that m; of these 
are in J, at time t, j = 1,- - +, n; this probability is 


pone (ca)® k! m ziz dé\ms 
ne TRY mle mye my! IU, 26-9%) | 


a/2 


d&|k-m 
[J u-ra- 93] 
“a2 a 

` n 
where 7 = I;, This probability can be written in the form 

a/2 aj2 r 
n le | FU,— 8) ag” [e fu-Fu-9) a| 
en TT —a/2 5 EAS 
j=l m;! k=0 k! 
a2 


e.| FG— Wel aja 
al dy ee J raa 


| T e mig 
fat m;! 


and, using (5.2), the last term converges to the right side of (5.1), as 
% —> 00, as was to be proved. 


§6 THE CENTERING OF THE GENERAL PROCESS 407 


As a particular case, suppose that the particle velocities exist, that is, 
almost all the sample functions of the «,(1) processes are absolutely con- 
tinuous. If v,(t, œ) = dz,(t, w)/dt, the conditions we have imposed above 
on the 2,(t) processes ,will be satisfied if the following conditions are 
satisfied. If n is any positive integer, and if 0< t, <` ` ` < tm the joint 
distribution of v(t), ` * *, Ut) does not depend on j. The classes of 
random variables 


{u(t),0<t<0,-o<j<o}, {x,(0),-o<j< co}, 
are mutually independent, and the classes 
{v,(t),0<t< a}, —a<j<, 


are mutually independent. Thus, under these hypotheses, a Poisson distri- 
bution of particles at time 0 is reproduced at all later times. Note that 
although the system is then in macroscopic equilibrium we have not 
supposed that the v(t) process is stationary, or even that the above 
hypotheses are self-reproducing. For example, «,(s) may not be inde- 
pendent of the v,(t)’s for t > s except when s = 0. 

All the above considerations have been one-dimensional. In r dimen- 


sions a Poisson distribution of particles is one in which, if hsi pets ia 
are r-dimensional intervals, disjunct except perhaps for points on (r — 1)- 
dimensional faces, of r-dimensional volumes 4, * * *, /,, and if u; is the 


number of particles in Z, then (5.1) is true. The translation of the 
results of this section into r dimensions is trivial. 


6. The centering of the general process with independent increments 


If {x,, t e T} is a process with independent increments, and if f is any 
function of t eT, the process {v,—f(t), te T} also has independent 
increments. Lévy has shown that f can be chosen in such a way that 
sample functions of the x, — f (t) process have simple continuity properties. 
As a first step in obtaining his results, we prove the following theorem. 

THEOREM 6.1 Let {a,, t eT} be a process with independent increments. 
Define T’ as the set of limit points of T, except that the minimum and 
maximum values of the closure of T are to be excluded from T’ unless they 
are in T. There is then a function f, defined for teT, such that, if 
2, = 2,— f (t), the process {zp tT} is a process with independent incre- 
ments, with the following properties: 

(i) To each point t e T’ which is a limit point of T from the left [right] 
there corresponds a random variable 2,[z,,] such that, if s,—>t with- 
Sa < t [Sn > t] and sp € T, then 

lim 2, = Ži- [lim zs, = Z] 


no n> oO 


408 PROCESSES WITH INDEPENDENT INCREMENTS Vill 


with probability 1. If the z, process is separable, these sequential limits 
can be replaced by ordinary limits 


lim 2, = 2 [lim z, = Z] (seT) 
ate sit 
with probability 1. 

(ii) If any difference z,—2,, or any such difference with t replaced by 
t+ or t—, or s replaced by s + or s—, is identically constant with prob- 
ability 1, the constant is 0. 

(iii) Except possibly for the points of an at most enumerable t set S C TT’, 
for each t eT the following equation holds with probability 1, between as 
many of the members as are defined: 


Zi- = Zi = As 


that is, at most enumerably many parameter points are fixed points of 
discontinuity of the z, process. The set S is independent of the choice of 
the function f. 

Any function f with the properties described in the theorem will be 
called a centering function of the x, process, and a process for which 
f=0 is a centering function will be called a centered process. For 
example, the z, process of the theorem is a centered process. The centering 
function f is not uniquely determined by the process, since, if g is a 
function defined and continuous on the closure of T, f+ g is also a 
centering function. It will be sufficient to prove the theorem in the real 
case, which we shall assume from now on, because in the complex case 
the real and imaginary parts of the «;’s can be treated separately. Our 
choice of a centering function is motivated by the following fact. Jf 
{Wn 7 = 1} is a sequence of real random variables, and if there is a sequence 
{cn} of constants such that 

lim (w, —c,) =w 


n-> 0 


exists and is finite with probability 1, then, if c,’ is for each n, the unique 
constant satisfying 


Efarctan (w, — c,’)} = 0, 
it follows that lim (w,,—c,’) exists and is finite with probability 1. In 
n> oO 


other words, lim (c,’ — c,) exists and is finite, To prove this assertion, 


n> 


let c be a finite limiting value of the sequence {cp —c,}. Then 


lim’ (w, —c,’) =w—c 


n> a 


$6 THE CENTERING OF THE GENERAL PROCESS 409 


with probability 1, where the prime signifies that n > œ along some 
sequence of integers. Hence 
0 = lim’ Efarctan (w,, — c,’)} = Efarctan (w— c)}. 
no 
Thus there is at most one finite limiting value of the sequence {cp — cn} 
since c is uniquely determined by this equation. Since an infinite limiting 
value must satisfy this same equation, with the obvious conventions, there 
can be no infinite limiting value. Hence lim (c,’—c,,) exists, as was to 
be proved. rae 
With this fact in mind, we define f (t), for each t «T, to satisfy 


Efarctan [x, — f (t)]} = 0, te, 
a= 2%,—f(t). 


Proof of (i) Suppose that s, —> t, Sn <t, Sn€T, and suppose that 
S;<5,<:*%. The series 


and define 


pa Ya) 


jen 5 


Se, 
1 


is a series of mutually independent random variables, and the equation 


n 
2 Ca xy) + (% Bansi) %,— Xs, 


in which the parenthetical differences on the left are mutually independent, 
and in which the right side does not involve n, implies that the series con- 
verges with probability 1 when centered (III, Theorem 2.8), that is. 

lim (x, — Ca) 


n> co 


exists and is finite with probability 1, for some choice of centering constants 


Cy Co, © + ‘+ Then, because of the way z,, was defined, it follows that 
lim z,, exists and is finite with probability 1. The hypothesis that the 


sequence {s„} is monotone can now be dropped, because, if the sequence 
is not monotone in the first place, it can be reordered to be monotone, 


Finally, if r 7 
s >t, Se ed Ser 
Se ty wer Sa h eS." Ly 
then i $ 

lima = lim 2,* 

nen “5 n> no n 
with probability 1, because the sequences {s,,’}, {s,."} can be combined into 
a single sequence {s,,} convergent to ¢ from below, yielding a convergent 


410 PROCESSES WITH INDEPENDENT INCREMENTS VIII 


sequence fz, } Thus 2, can be defined as the limit of z, when s } t 
sequentially, The definition of z,, is made in the same way, in terms of 
sequential approach from above. (Use the preceding results, replacing 
x, by x = x to convert approach from above to approach from below.) 
In particular, if the z, process is separable the sequential one-sided limits 
can, as always in the separable case, be replaced by ordinary limits. 
Proof of (ii) Since 
Efarctanz,}=0, teT, 


integration to the limit yields 
Efarctan z,_} = Efarctanz,,}=0, teT’. 


These equations imply that, if one of the differences described in (ii) is 
constant with probability 1, then the constant must be 0. 
Proof of (iii) The result of (i) implies that 


plimz,=2,, plimz,=2,,, teT’ 
stt ayt 
whenever the right sides are defined. Then (iii) follows as a direct 
application of VII, Theorem 11.1. The set S of fixed points of dis- 
continuity is independent of the choice of centering function, because a 
change of centering function can only change the differences z, — z, and 
244—% by constants, and such a change cannot reduce one of these 
differences to 0, by (ii). If t«T7’T is a limit point of T from both sides, 
it is not a fixed point of discontinuity if and only if 


Zi e = (a 4%) + yy — 2%) = 0 


with probability 1, because the sum of two mutually independent random 
variables is constant with probability 1 if and only if each summand is. 
(This statement follows from the fact that the product of two character- 
istic functions has modulus identically 1 if and only if the same is true of 
each factor.) 

In the following we shall adopt the convention that, if {fta te This a 
centered process with independent increments and if f e T is not a limit 
point of T from the left [right], then 2,_ [x] is defined as z, 

The proof of Theorem 6.1 has been completed, but it may be instructive 
to note that the set S of the theorem can be described without the use of 
a centering function. In fact, we now prove that, if a is a parameter 
value, held fast below, then S(a, 0] is the set of discontinuities of the 
monotone non-increasing t function defined by 

1 


[ikem du, t> a. 


0 


$6 THE CENTERING OF THE GENERAL PROCESS 411 


To prove this, let ©, be the characteristic function of £, — Xa, 


Dp) = Efe Eo), 
Then, since 
Dalu) = D (Efer soy h>0, 


it follows that |,()| is monotone non-increasing in ¢ for fixed u. Let 
f be any centering function of the process, let z,=2x,—/(t), and let 
Y, be the characteristic function of z,,—2,. Then |¥,| = 1 when 
tS, but |F (u)| < 1 at some points of the interval [0, 1] if f € S, since, 
for such a value of f, 2,, — ż; is not with probability 1 a constant. More- 
over, the characteristic function of z,—z, has absolute value |®,|, and 
since the characteristic function of z;— z, jumps by ¥, at a point seS 
but is continuous at other points of T, it follows that |®.| jumps by |'Y,| 
at a point s e S, but is continuous at other points of T. Then S is the set 
of points of discontinuity of the ¢ function defined by 


1 
[IDD au, 1 >a, 
0 


as was to be proved. 

Although there is no reason to prefer one centering function to another 
as far as Theorem 6.1 is concerned, we shall see that certain centering 
functions haye special advantages. 

THEOREM 6.2 Let {x,, t e T} be a process with independent increments, 
and let I be the closed interval with endpoints the minimum and maximum 
values of the closure of T except that the endpoints are themselves included 
only if they are in T. Then it is possible to define x, for t e I— T in such 
a way that the process {a,, t eT} has independent increments, and that the 
latter process is centered if the former is. 

Let f be a centering function of the process, and let z, = x, =A. If 
te I— T and if t is a limit point of T from the right, define x, = 2,,. 

The remaining points of 7 are in semi-open or open intervals [c, d) 
or (c, d) with d either in T or not in T but a limit point of T from the 
right. Define x, = x, in each such interval. The process {a,, te I} as 
so defined has independent increments. If the x, process is centered, 
we can take f(t) = 0, and in that case the extended x, process will also 
be centered. 

THEOREM 6.3 Let {æ t eT} be a centered separable process with inde- 
pendent increments. Then, if c, d e T, almost all sample functions of the 
process are bounded for c < t < d. 

It is sufficient to prove this in the real case, and we shall accordingly 
assume that the process is real from now on. Let m(t) be a median of 
zaxe If s, tt [sn } t] with c<s,<d, and s,¢T, then every 


412 PROCESSES WITH INDEPENDENT INCREMENTS VIII 


limiting value of the sequence {m(s,)} must be a median of x,— 2, 
[za — x]. It follows that m is a bounded function of ¢ for e < t < d; 
say |m(t)| < K. Define 

aca a mi), caiga 


Then the z, process has independent increments, and 0 is a median of 
zae If 
Cty ent i t,€T. 
it follows from III, Theorem 2.2, that 
P{Max z (œw) > A} < 2P{z{o) >A}, A> 0. 
i 
Hence 
P{Max [v (œ) — x ()] = A+ K} < 2P{x,(o) — vho) > 2}. 
j 


Since this inequality is true for all finite subsets {ż;} of T[c, d], it is also 
true for enumerably infinite subsets, and therefore (separability of the «, 
process) 
P{ L.U.B. [x,(w)—2,(w)] > 4 + K}< 2P{x,(w) — vko) =}. 
te Tic, d) 
Applying this result to the —x, process and combining the two 
inequalities, we obtain 


(6.1) P{ L.U.B. |a(w)—x(w)| >A+ K} < 2P{|x,(w) — 2,(w)| > 2}. 
te Tic, d] 


This inequality implies that almost all sample functions of the x, process 
are bounded in the interval [c, d]. 

We now construct an example which will clarify the role of the fixed 
points of discontinuity of a process with independent increments. Let 
ty, tg, * > be a finite or enumerably infinite linear set, and let Z be the 
closed interval with endpoints the minimum and maximum values of the 
closure of {7} except that the endpoints are not included in J unless they 
themselves are ż;s. It is supposed that to each 1; corresponds a pair of 
real random variables u, v;, with the following properties: 

DP, The random variables uy, V4,.Ua, va, * + + are mutually independent. 

DP, P{u;(w)? + v(@)? > 0} > 0,7 = 1, and, if a u; or a v; is identically 
constant with probability 1, the constant is 0. 

DP; For every closed finite interval JC I the series 

È u, > % 
Gej ted 
converge with probability 1, no matter how the summands are ordered 


(that is, in the terminology of ITI §2, both these series, in some order of 
summation, have absolute centering constants 0, 0, + > +). 


§6 THE CENTERING OF THE GENERAL PROCESS 413 


We recall from III §2 that the sums in DP, will be independent of the 
order of summation, neglecting values on w sets of zero probability. 

Let « be any point of J, fixed hereafter, and define the random variable 
x, for te I by 


(6.2) (Peat ae TT SUSUR Ps te 
a<h<t = a<tj<t 
=— >. w4-—>d> wy, ta. 
t<tj<a t<ty<o. 


Then the process {2,, t e I} has independent increments, and 


tir, = D us Dy, Ste 
8<ty<t s<tj<t 
Note that, for each f, x, is defined with probability 1 according to this 
definition, but x, may not be defined simultaneously for all t with prob- 
ability 1, since we have not supposed that the series in DP; converge 
absolutely with probability 1. However, we simply define each 2, 
arbitrarily where it is not already defined. Ifs, + tg {t,}, then 


Xe, — Xe, = Mea = x, ) 
I 
and, ifs, | t4 {t} 


a = Se, = Benni) 
1 


with probability 1, by III, Theorem 2.7, Corollary 1. Hence in each case 
lim x, = v, with probability 1. Similarly 


nx 
lim £, = a, — Uj Set ty 


n—0 


=H +% Sn bts 
with probability 1. The x, process is centered, and 
Uy = Lp, — Ti,» 05 = 2,4 — Uy 


The #,'s are the fixed points of discontinuity of the process. 

We summarize these results, and add another, in the following theorem. 

THEOREM 6.4 If an x, process with independent increments is defined by 
(6.2), under DP,, DP,, DP3, then the fixed points of discontinuity are the 
t's with the discontinuities indicated in the preceding equations. Moreover, 
if the process is defined in such a way that it is separable, then almost all 
the sample functions are continuous except at the tps. 

Only the last statement of the theorem remains to be proved. It is 
much stronger than the statement that only the ¢,’s are fixed points of 
discontinuity. The continuity properties of this process are at the other 


414 PROCESSES WITH INDEPENDENT INCREMENTS VIL 


extreme from those of a separable Poisson process in which (except in 
degenerate cases) almost all the sample functions are discontinuous, even 
though there are no fixed points of discontinuity. To prove the theorem 
we shall reduce it to the case in which the ups and v,’s are uniformly 
bounded and have zero means. It is sufficient to prove that almost all 
sample functions of the x, process are continuous, except at the t;s, in 
every closed interval JCJ whose endpoints are ;’s._ To avoid compli- 
cating the notation we can, therefore, suppose in the first place that DP, 
is true with J replaced by J, where J is closed. We suppose then that the 
o œ 


series > u; > v; converge with probability 1, regardless of the order of 
1 1 


summation. We now apply the three-series theorem (III, Theorem 2.5). 


Define 
uj (w) = uw) |u,(o) — m,;| <1 


= My juo) — m| > 1, 
v/o) = rf) —_|v,(@) — m,,| <1 
=M; |v,(@) — m| > 1, 


where u; and v; have median values m,,;, M, respectively, and let o? be 
the variance of u; + v,/. Then, according to the three-series theorem, 


the series 
o 


Euh ZE) Žo 


„M8 


S Pino) # u, (oh S Pioo) # V, (o)}, 


all converge, and even converge absolutely, since there is convergence for 
o w 

every order of summation. The series > u,’, > v; are series of mutually 
1 1 


independent random variables, with absolute centering constants 0, 0, © > - 
[because u,(w) = uj(w), v(m) = v/(w) for sufficiently large j, with 
probability 1]. Hence we can define an. 2,’ process based on the uj’s 
and v;”s just as the x, process was based of the u,’s and vs. By Kol- 
mogorov’s generalization of the Chebyshev inequality, III, Theorem 2.1, 
if a is the initial point of J, and if a = sọ <` - * <s,, then 


Files |xe,/ (co) — Efe,/} — lz (0)— Efe Y| > 0} < pe aq’)? 


§6 THE CENTERING OF THE GENERAL PROCESS 415 


Now suppose that 


Š (Ew + [ee è 


(6.3) é 
2 [P{u (0) + u,'(w)} + P{v,(w) A v;(o)}] < 6. 
Then a 
2 oF, 
P{Max |æ, (0) — #_'(w)| > 20} < Grd 

and ; 

P{x (o) =z; W) j< nh>l—sé. 
Hence 


5 a; 
P(Max |x, (w) — z(«)| > 20} < gto 


Since the x, process is separable, this inequality implies the inequality 


5 o? 
(6.4) P{LU.B. |2(0) — 2,(0)| = 23} < Tr + ô: 
t 


Now suppose that a sample function of the x, process has a discontinuity, 
not at a t; at which the oscillation is at least 46. For such a sample 


function 
L.U.B. |a(w) — ()| => 20. 
t 


In other words, the set corresponding to such sample functions is 
included in an set of probability at most the right side of (6.4), that is, 
the outer measure p, of the former œ set satisfies the inequality 


æ 
ef 
2 5; 


AS t+ 


Moreover, if a finite number of the ¢,’s, together with the corresponding 
ups and vs are deleted from this development, the sample function 
discontinuities are not changed except at the deleted 7,’s. Hence 
o 
2 oP 
T 

FJ +, 


Pos 


for every k for which (6.3) is true with the sums over j > k (instead of 


j > 1). The inequality for ps is therefore true for sufficiently large k, and 


we find that 
Pos ô. 


416 PROCESSES WITH INDEPENDENT INCREMENTS VIIE 


Since ps is monotone non-increasing in 6, this inequality means that 
ps = 0 for every 6, that is, almost all sample functions of the x, process 
are continuous except at the ż¢;’s (where their discontinuities have already 
been analyzed), as was to be proved. 

We observe that by application (b) of VII $12 we could have deduced 
without any calculation whatever the fact that almost all x, process sample 
functions have left- and right-hand limits at all their discontinuities 
(which implies that they have at most enumerably many discontinuities) 
but some calculation is necessary to show that almost all the sample 
functions are continuous except at the 1,’s. 

Now consider any real process {x,, t e T} which has independent incre- 
ments. It will be convenient to suppose that T is an interval, and we 
shall do so. This is no real restriction, according to Theorem 6.2. Let 
fi be a centering function of the x, process, and let {,} be the set of fixed 
points of discontinuity of the centered process. Define ù, Č; as the 
jumps of the centered process on the left and right at t;, 

iy = ay Alt) — -AOA B= — AO, ey AE. 

It is clear that #, ùa, * + +, Či, Ùa © © + are mutually independent random 
variables. We have seen in II §2 that, if a series of mutually independent 
random variables converges with probability 1 when centered, there are 
absolute centering constants (which can always be taken as truncated 
expectations for example) for which the centered series will converge with 
probability 1 regardless of the order of summation, and for which every 
subseries has this same property. Let {u,}, {v,} be the sequences {i;}, {%;} 
when centered by subtraction of truncated expectations, as defined in 


II §2. Then, if J : [a, b] is a closed subinterval of T, if {tą } is the subset 
of {£} in J, and if A, is defined by y 


n 
2 ua + A, = T, — ty 


it follows that the random variables on the left are mutually independent. 
But then, according to III, Theorem 2.8, the series = ua, converges with 
prc babuity 1 when centered. The same argument a applicable to the 
series 2 Vay Since the centering has already been done, we have proved 
that, if J is any closed interval of parameter values, the series 

2, Uys È % 


ted tre J 
converge with probability | regardless of the order of summation. Now 
define x,‘ as the right side of (6.2). We have seen (Theorem 6.4) that 


§7 THE CHARACTER OF THE DISTRIBUTION FUNCTIONS 417 


the x,” process has independent increments, is centered, and has the f,’s 
as its fixed points of discontinuity, with 


(i jpa ekas d a}: 
By yO =u, Typ — a UERN 


The process {x, — fi(t)— x,™, t e T} has independent increments. Let fy 
be a centering function of this process, and define x,‘° by 


(6.5) a HO) Hata, f=f+he- 


Then the a,” process has independent increments, is centered, and has 
no fixed points of discontinuity. Moreover, the two processes 


{a, te T}, {a9 te T} 


are mutually independent. Note that / is a centering function of the x, 
process. It is not an arbitrary centering function, but one chosen in 
such a way that the discontinuities at the fixed points of discontinuity 
have special properties. The æ,® process is the type described in 
Theorem 6.4. The decomposition (6.5) is due to Lévy. In any particular 
case all components need not be present. The decomposition (6.5) is obvi- 
ously also applicable to complex processes with independent increments. 


7. The character ,of the distribution functions and the continuity of the 
sample functions 
Let {x a < t < b} (a, b finite) be a process with independent incre- 
ments; suppose that the process is centered and that there are no fixed 
points of discontinuity, Let ®, , be the characteristic function of #,— 2, 
<t), 
®, (1) = Efel], s<t. 


Then, if s1 < S2 < 53, 
(7.1) Dran = Pas” Pass 
because the process has independent increments. Moreover, 


lim ®, (u) = 1 
tos 
uniformly in s, t, u for 4 in any finite interval, because the process is 
centered and there are no fixed points of discontinuity. This means, by 
(7.1), that ®, (x) is continuous in s, f, 4 with value 1 whens = t. Then, 
if we write ®, , in the form 


n-1 
=] [Pe S=S+IE-HIn, 
0 


418 PROCESSES WITH INDEPENDENT INCREMENTS VIII 


the characteristic function ®, , is expressed as the product of characteristic 
functions which can be made uniformly close to 1 in every finite x interval. 
It follows that the distribution of x,— æ, is infinitely divisible (III §4), 
so that (III, Theorem 4.1) for each > 0, 


Ca 


; inà \1 +22 
(7.2) log ®, (4) = iny, 4 J (e 1 K J 5 dG(t, 2), 


t 
where G(t, ') is monotone non-decreasing continuous on the right and 
bounded in A, with G(t, — œ) = 0, and y, is a constant. The left side of 
this equation is continuous in ¢. It follows that y. is continuous, and 
lim G(t,, 2) = G(t, 2) 
tye 
at all points of continuity (in 2) of G. [Cf. the discussion in III §4 of 
the determination of y and G by the corresponding distribution, and note 
for use below that according to the reasoning of that section the functian 
G is uniquely determined by the left side of (7.2) even if G is merely 
supposed of bounded variation in A rather than monotone.] The above 
reasoning applied to the interval [s, £] gives the same formula for ®, , 


with Ys, G(s, t, 4) in place of y, and G(r, 2). On the other hand, according 
to (7.1), 


; jes id 
(7.3) log ®, (4) = ity. — y) + | (em ain i z) 


1+2 
2 


d[G(t, 2) — G(s, A)]. 
We deduce that 
td >i Sa Sa 
G(s, t, 2) = G(t, 2) — G(s, A), 

so that (7.3) exhibits log ®, , in the Lévy-Khintchine expression for an 
infinitely divisible distribution. Then, if s < t, G(t, :) — G(s, +) must be 
non-negative and monotone non-decreasing in A. In other words, G is 
Monotone non-decreasing in both 4 and z, d,G and d,G are monotone in 
t and A, and G can be used to define a measure ff dd,G in t, 2 space. 
This measure will prove important. 

Conversely, suppose that for all s, t, with 0< s < t, ®, (u) is deter- 
mined by (7.3), where y. and G have the properties described. There is 


then a centered process with independent increments and no fixed points 


of discontinuity, obtained by assigning to «,—2, the distribution 
determined by (7.3). 


§7 THE CHARACTER OF THE DISTRIBUTION FUNCTIONS 419 


If the distribution of x, — x, depends only on t— s, and if the process 
has independent increments, it is said to have stationary independent 
increments. If such a process is centered, it can have no fixed points of 
discontinuity, because the process has the same stochastic properties at 
each value of the parameter. In this case (7.3) combined with (7.1) yields, 
ifa=0, 

Ve = Ys Hye C(s + 1,4) = Gls, 2) + GC, A) 
so that 
Y=, C, 4) = tG) 


for some constant y and function G(-). Thus (7.3) becomes 


(7.3) log ®, (4) = i(t — s)yu + (t— s) 


EN ud NA 
Hemra a 


=o 


Example 1 If the process is the Brownian movement process of §2, 
y = 0 and G(-) is constant except for a jump of magnitude o? at A = 0. 

Example 2 If the process is the Poisson process of §4, y = 1 and GC) 
is constant except for a jump of magnitude c/2 at A = 1. 

Example 3 Consider the following process: Events are to occur in 
accordance with a Poisson distribution, at average rate c. The random 
variable x, is defined as the sum of N independent random variables, each 
with a given distribution function F, where N is the number of events that 
have occurred between times 0 and ¢ inclusive. In other words, at each 
event a drawing is made from the F distribution and æ, is the cumulative 
sum of the numbers drawn. The 2, process has stationary independent 
increments; it is centered and has no fixed points of discontinuity. The 
characteristic function Dp, is easily evaluated: 


w 


onu) = See"! f eraro] 


-0 


so that 
log Do (4) = ct | (> — 1) dF). 
Making the proper identifications in (7.3’), we find that 


o 4 

1. dF) n i we 

aad ses - mpg eT As TS lc 
= 


-=o 


420 PROCESSES WITH INDEPENDENT INCREMENTS Vill 


Example 4 Gaussian case Suppose that {x, a < t < b} is a process 
with independent increments and that x, — x, has a Gaussian distribution. 
According to a theorem of Cramér, if the sum of two mutually independent 
random variables is Gaussian, each of the random variables is itself 
Gaussian, and it follows that every difference x,— v, is Gaussian. In 
the case which interests us, when the process is centered and has no fixed 
points of discontinuity, it is unnecessary to use the Cramér theorem to 
obtain this result. In fact in this case, using the notation of (7.2), G(b, -) 
is constant except for a jump at2 = 0. Since d,G(t, A) is monotone non- 
decreasing in z, it follows that G(r, +) must also be constant except for a 
possible jump when 2 = 0. Then the same is true of G(r, :)— G(s, -), so 
that x, — x, is Gaussian. Now define 


olt)? = E{(x,—2,)}. 


If o()? has the form const. (t — a), the x, process is the Brownian motion 
Process with parameter interval [a, b]. If not, o(-)? is still continuous, 
and, if y, = £, t = o%(s), the y, process is the Brownian motion process 
in the interval [0, o(b)?]. We conclude that the sample functions of the 
%, process are (almost all) continuous functions in [a, b], if the process is 
separable. 

The following theorem strengthens the results obtained in discussing 
this example, and adds a converse. 

THEOREM 7.1 Let {2,,a< t< b} be a centered process with independent 
increments and no fixed points of discontinuity. 

(i) The distribution of every difference x, — X, is infinitely divisible. 

(ii) The following conditions are equivalent: 


(a) £z, — x, is Gaussian. 
(b) Every difference x, — x, is Gaussian. 
(c) If R is a denumerable subset of [a, b], dense in [a, b], almost all 


sample functions coincide on R with Junctions defined and continuous 


on [a, b] (that is, if the process is separable, almost all sample 
Junctions are continuous on [a, 5].) 


(d) fa=s,<--- <5, = b, and if ô = Max (S;44 — 5,), then 
j 


n—i 


(7.4) lim È P{la,,, (o) — %,(0)| > e} = 0. 


60 j=0 


We have already proved (i), the equivalence of (ii) (a) and (ii) (b), and 
the fact that (ii) (b) implies (ii) (c). To finish the proof we prove that 
(ii) (c) implies (ii) (d) and that (ii) (d) implies (ii) (a). 


§7 THE CHARACTER OF THE DISTRIBUTION FUNCTIONS 421 


If (ii) (c) is true, and if the process is separable, almost all sample 
functions are uniformly continuous on [a, b], and this implies that 


(7.4) lim P{Max |e,,, (@) — #,,(@)| > 6} = 0 
60 j 


for all £> 0. We have already remarked (III §1) and used several times 
the fact that (7-4’) and (7.4) are equivalent, that is, the probability of at 
least one of a number of mutually independent events and their expected 
number go to 0 together. Hence (ii) (d) is true. If the process is not 
separable, we change each x, on an w set of probability 0 to make the 
process separable. This change does not affect the validity of (7.4) or 
(7.4). Finally, if (ii) (d) is true, (ii) (a) is also true, by one form of the 
central limit theorem (see the Corollary to II, Theorem 4.1). 

It is convenient, in analyzing the structure of a process with independent 
increments, to write (7.3) in Lévy’s original form, 


2 2 
Oe eg 


Be 
[fe fle 1 a) difF(t, 2) — Fs, a]. 


A 
2 
aaj e LE 


(1.5) log Ọ, (4) = 1 — v)u — 


where 


g? 
im 2 
-— [ Face, 2>0 
a 
A 


oè = Gt, 0 +)— Gt, 0—). 
Then F(t, +) is monotone non-decreasing in À for 2 > 0 and for A < 0 and 


| [adr (t, 2) defines a t, 2 measure whose significance will be discussed 


below. The function F(t, -) can be supposed continuous in å on the right 
for 1 0: it vanishes at + œ. Although F may not be bounded near 


1 -0 
2d, F(t, A) < ò. 
[J+J] ear )< ò 


If the x, process is a Poisson process, with average occurrence rate c, 
then y, = ct/2, oè = 0 and F(t, *) jumps ct units at 2 = 1 but otherwise 
does not contribute to (7.5). More generally, define x, by 


a= > Ae + oy, + Bt, 
j=1 


422 PROCESSES WITH INDEPENDENT INCREMENTS VIL 


where the x,), + + -, x™, y, processes are mutually independent: the 
x, process is a Poisson process with average occurrence rate c;; the 
Y, process is a Brownian motion process with variance parameter Ls 
Ay * + % Ans Cr © *s Cm O, B are constants; o >0; the A,’s are distinct 
and none vanish, Then the x, process is a centered process with stationary 
independent increments, with 
: n CsA; 
n (s POEST 
and F(t, 2) = tF(A), where F(-) increases for A #0 only in jumps of 
magnitude c; at A; In this way, then, the most general centered process 
with stationary independent increments can be approximated by the sum 
of a Brownian motion process, a linear combination of Poisson processes 
and a linear function, approximated in such a way that the first two 
terms on the right in (7.5) are exactly reproduced and the integral replaced 
by a Riemann-Stieltjes sum. In the approximating process the integral 


) ofa ot 


tp 


J Í did Flt, a) 


is simply the expected number of jumps (occurrences in the component 
Poisson processes) of the sample functions, of magnitude between A and u 

between times s and ¢. This argument is easily extended to the non- 
stationary case, and makes it plausible that in the most general case 
almost all the sample functions of a centered separable process with 
independent increments and no fixed points of discontinuity are continu- 
ous except for jumps, and the above double integral is the expected number 
of jumps of the sample functions, of magnitude between å and u between 
times s and t. These results of Lévy’s will now be proved. 

THEOREM 7.2 Except possibly for a set of sample functions of prob- 
ability O the sample functions of a separable centered process with 
independent increments {x,, t e T} have the following properties: 

7 They are bounded on every parameter set of the form [c, d]T, with 
c,deT. 

(ii) They have finite left- [right-] hand limits at every te T which is a 
limit point of T from the left [right]. 

(iii) Their discontinuities are jumps, except perhaps at the fixed points 
of discontinuity, 

This theorem should be compared with the corresponding martingale 
theorem, VII, Theorem 11.5. Part (i) is simply a restatement of Theorem 
6.3, repeated here only for the sake of completeness. Part (ii) was 
proved in VII §12 as an application of martingale theory [application 


§7 THE CHARACTER OF THE DISTRIBUTION FUNCTIONS 423 


(b)]. The fact that the parameter set of that application is an interval is 
irrelevant, by Theorem 6.2. Part (iii) follows from (ii) and the definition 
of separability (see the reasoning used in the proof of VII, Theorem 11.5). 
A function which is continuous except for jumps is necessarily continuous 
except perhaps at the points of a finite or enumerable set. Hence almost 
all sample functions under consideration have this property. 

As three examples we mention the separable Brownian motion process 
whose sample functions are (almost all) continuous; the separable Poisson 
process whose sample functions are (almost all) monotone, increasing in 
unit jumps from the left-hand limit at a discontinuity to the right-hand 
limit, but whose points of discontinuity vary from function to function; 
and the processes considered in Theorem 6.4, whose sample functions are 
almost all continuous except at the fixed points of discontinuity. 

We close this section by a discussion of the distribution of the jumps 
of the sample functions of a centered separable process {x, a < t < b} 
with no fixed points of discontinuity. Let v}, be the number of jumps 
of magnitude (that is, right-hand limit minus left-hand limit) < 2 (< 0) 
between a and t, Then »,, is a finite-valued integral-valued random 
variable. The v, , process (fixed 2) obviously has independent increments 
and it is centered; it has stationary increments if the x, process has. In 
the latter case the v, , process is a Poisson process, since it satisfies the 
qualitative defining characteristics (a), (b), (c) detailed in §4. If the x, 
process does not have stationary increments, the v; , process is a Poisson 
process except for a change of the time variable. We wish to prove 


ta 

(7.6) Ep} = | f dds, x) = Ft), 

where F(-,:) is the function in (7.5) supposed continuous on the right, 
and thus explain the significance of the (t, 2) measure defined by FC, +). 
[The corresponding identification of — F(-, :) for 4 > 0 with the expected 
number of jumps of magnitude > 4 between 0 and ¢ is treated in the 
same way or reduced to the previous case by replacing x, by —2,.] To 
make the desired identification we need only go back to the derivation of 
(7.5) in III §4, which can be outlined as follows: Write x,— a in the 
form 


Jj 
X,— to = 2, Con ER) T (2-2) 


n—1 n-1 o 
= "S$ +S Om y M) 
j= om 


where m;™ is a centering constant, obtained as described in III §4; it is 


the expectation of x,,,, — ,, after truncation of the latter around a median 


424 PROCESSES WITH INDEPENDENT INCREMENTS VIII 


value. Since the x, process has no fixed points of discontinuity, the 
m;™s are uniformly small for n large. The form (7.5) is then obtained, 
by characteristic functions, with 

n= 
(1.7) FE A)— F(s,4) = lim > Pir, (0) — z (0) — mj < 4}, 

ia 1<0, 
at least at all the A continuity points of this limit. Now, fixing t, let A 
be the (at most enumerable) set of values of 2 at which F(t, +) has a dis- 
continuity in 4 or for which the probability is positive that there is a 
sample function jump of magnitude 4 between 0 and ft. Let », ;™ be the 
number of values of j for which a sample function satisfies the condition 
between the braces in (7.7). Then 


(7.8) lim», =v A¢A 


n=% 
for almost all sample functions, and (7.7) becomes 


(1.7) F(t, 4) = lim Efy, }, Ag A. 
no 
Moreover, since v, ;") is the sum of n mutually independent random 
variables, each of which takes on only the values 1 or 0, its variance is 
the sum of the variances of the summed variables, which is at most the 
sum of their second moments, which is in turn the sum of their first 
moments, 
Efn, °°} — By, (p< Ef, (} > F(t, A). 


Then Ef»; ;™?} remains bounded when n—> œ. Hence integration to 
the limit in (7.8) is legitimate, and implies, using (7.7’), the desired relation 


(7.6) for A ¢ A. Then (7.6) is true for all 2 <0 by continuity, as was to 
be proved. 


CHAPTER IX 


Processes with 
Orthogonal Increments 


1. Continuity properties 

Processes with orthogonal increments were defined in II §10. It was 
proved there that to each such process {y;, t € T} corresponds a monotone 
non-decreasing function F, uniquely determined up to an additive constant, 


satisfying 
(1.1) E{ly,— ysl} = FO- FO, 8 <4 
which we shall write in the symbolic form 

E{|dy,|?} = dF). 


The continuity properties of F determine those of the y, process, in the 
following way. 

THEOREM 1.1 Let {y,, t eT} be a process with orthogonal increments. 
Define T’ as the set of limit points of T except that the minimum and 
maximum values of the closure of T are to be excluded from T’ unless they 
are inT. 

(i) To each point teT’ which is a limit point of T from the left [right] 
there corresponds a random variable y, [y] such that 


Lim. Ys = Y+- [l.i.m. Y; = Tek 
stt 84t 


Gi) Except possibly at the points of an at most enumerable t set, for 
each t the following equation holds with probability 1, between as many of 
the members as are defined: 

Yi- = Yi = Yir 

This theorem is almost obvious from (1.1). To prove that there is a 
Yı for example, we need only remark that, if t € T’ and if ¢ is a limit 
point of T from the left, then F(s) is bounded for s < t and 
lim Ef|ys, — Yal = im [F(s2) — F(s)] = 0, 

81, Batt 


$1, 82tt 
425 


426 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


so that Lim. y, = t, exists. The exceptional £ set of (ii) is the set of 
att 
jump points of the monotone function F, and obviously 
F()— F(t—) = Effy — y- 
F(t +) — FQ) = Effy — yl} 
F(t +) — Ft—) = Eflyn. — y- 

Let I be the closed interval whose endpoints are the minimum and 
maximum values of the closure of T, except that these endpoints are 
themselves included only if they are in T. Then we can define y, for 
te I— T, in such a way that the resulting process has orthogonal incre- 
ments. To do this suppose that f e /—T and that ¢ is a limit point of 
T from the right. Then define y, = y,,. The remaining points of /— T 
are in disjunct semi-open or open intervals [c, d) or (c, d) with d either in 
T or not in T but a limit point of T from the right. Define y; = y, in 
each such interval to complete the definition of the extended y, process. 

Throughout the rest of this chapter the parameter range T will be an 
interval. The result just obtained shows how little a restriction this is. 


2. Stochastic integrals 


Let {Yẹ teT} be a process with orthogonal increments. In the 
following, T will always be a finite or infinite interval. Let ® be a fixed 
t function (that is, one not depending on w). We shall define 


p= | Ody, 
A 


for a large class of functions ® and sets AC T. The integral ọ will be 
a random variable. Since the sample functions of the y, process are not 
of bounded variation except in special cases, p cannot be defined as an 
ordinary Stieltjes integral in the individual sample functions. The 
integral in question is, however, a generalized Stieltjes integral, and in an 
attempt to make it look more like one we shall adopt the notation y(t) 
instead of y, writing the integral 


p= | D0) dy). 
A 


In the following we shall assume that the parameter range T is the infinite 
line (Ga co, ©). The modification necessary for T a different interval will 
be obvious. We first define p when ® is a step function of a special type 
and A =T. Ifa, <--- <a,, and if 
O(t) =0,t <a, 
= Ch lja S t <a, Jan, 
=0, 1 > a, 


§2 STOCHASTIC INTEGRALS 427 


then we define 


@.1) p= | PO ayo = ¥ eia Way. 


In fact, we shall accept as p any random variable equal almost everywhere 
to the sum on the right. The integral will, however, always be understood 
to be one particular random variable, not a class of equivalent ones. 

As defined by (2.1) the integral is determined uniquely by ®, neglecting 
values on sets of zero probability, linear combinations of ®’s correspond 
to the same linear combinations of the corresponding g's, and 


eD Elf omdnon | YO dnl) = | COTO AFCO. 
W T e ij 


Equality (2.2) can be interpreted as follows: we have set up a corres- 
pondence between certain functions of ¢ (step functions ®) and random 
variables p. If distance between ’s and distance between q’s is defined by 


ilo, — || = [Sew e.02 aro)” 
(2.3) id 

\la— pall = [Ef] — pall, 
equation (2.2) implies that distance is preserved by the correspondence. 
Now suppose that © is a limit (in the sense of the above distance) of a 
sequence {®,,} of step functions of the above type. Then 


[2 — ©, |? = f |OH—O,OP AF) > 0 (n> o), 


that is, 
Lim. ®, = © 
noo 
[where in the Li.m. we are using the weighting dF(t)]. It follows, since 
distance is preserved, that lim. @,, also exists, defining a random variable 
n= 


o 
g. This random variable, as a limit in the mean, is defined uniquely, 
neglecting values on an @ set of probability 0. Aside from this intrinsic 
lack of uniqueness, p is independent of the particular sequence {®,} 
chosen, since two sequences converging in the mean to ® can be combined 
to form a single sequence converging in the mean to ©, whose corres- 
ponding sequence of random variables converges in the mean. We define 


forma 
i 


as the limit ọ obtained in this way, or rather as any random variable 
equal with probability 1 to a limit obtained in this way. The class of 


428 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


functions © for which the integral is defined is the class of functions oft 
which are measurable with respect to the Lebesgue-Stieltjes dF measure 


j dF(t) 
A 


and for which 
f |O)? dF(t) < ©. 


pi 


(This definition is simply an application of the principle that a uniformly 
continuous function defined on a point set of a metric space, and taking 
on values in a complete metric space, can be defined on the closure of its 
domain of definition by continuity.) Equation (2.2) is true in the general 
case since it is true for step function integrands. Finally, if Æ is a Borel 
t set or more generally is any ¢ set measurable with respect to the dF 
measure, and if © (t) is defined as (z) on A and 0 otherwise, we define 


[oOo = | &(0 ayo) 
A T 


if the integral on the right exists. With this definition (2.2) is true for T 
replaced by A. The stochastic integral is obviously linear and homo- 
geneous in the integrand and additive in the domain of integration, 
neglecting values of the integral on œ sets of probability 0. 
In many applications, 
Efy(t)— y(s)} = 0. 
If this is true, then 


E{ | (0) dy(1)} = 0 
A 


because this equation will obviously be true for the step function integrands 
and A =T. 
Note that, if T = (— , 00), 


(2.4) lim. | (1) dy() = KOLO 
bopo lt b] Žo 


since, if the integral on the right exists, 
25) El | Wa- [ DO aol 
Še ta, b) 


= | 0O F0 (a> o, b> + o). 
{t ¢ (a, b)} 


§2 STOCHASTIC INTEGRALS 429 


It is easy to show that, if ® is continuous in the finite interval [a, 5], 
or even merely Riemann-Stieltjes integrable with respect to F, then 


b+ 


n—1 
(2.6) Lim. S OG) — we] = | OO dy), 


jo = 
0 j=0 ne 


where 


a= + <th=b, Xt} < ty, 6= Max (tu — 4), 
j 


and the t,'s are points of continuity of F. Here we interpret the integral 
as over the closed interval [a, b], and if F is not continuous at a, b we 
replace y(fo) by y(t —) and y(t,) by y(t, +) on the left. To prove this 
limit equation let ¥(r) be defined by 


W(t) = DA) — O(/), t elts tia) 
= O(t) — O(t,’), ate 


Then the absolute value squared of the difference between sum and 
integral in (2.6) has expectation 


tj: b+ 
EUS | [66/)— 4) aol} = f POR AF0). 
Dy w= 


The function ¥ is bounded, independently of the t;s and 1’,’s, and goes 
to 0 with ô for almost all ¢ (dF measure) according to the hypotheses 
imposed on ®. Then the right side of this equality goes to 0 with 6, as 
was to be proved. If t;s are allowed to be at discontinuities of F, the 
approximating sums will have to correspond to semi-closed intervals, but 
the proof needs no change. 

In the following we shall need a criterion for convergence. Since 


J 00- DOPO = E|| f OH an) f OO ayol’), 
A A A 


the sequence of stochastic integrals al D(t) dy(t)} converges in the mean 
A 
to Í D(t) dy(t) if and only if the integrands converge in the mean to ®(t) 
A 


[weighting dF(t) on A]. 
Suppose now that the process {y(t), t e T} has orthogonal increments, 


430 PROCESSES WITH ORTHOGONAL INCREMENTS 1X 


with E{|dy|?} = dF, where T is an interval as usual. We shall have 
occasion to use double integrals of the form 


(2.7) [| 6,9 ds ay, 
A 


where A is a two-dimensional Borel set, or more generally is measurable 
with respect to the two-dimensional Lebesgue-Stieltjes ds dF measure 


Í f ds dF(t). 


This integral will be defined in terms of the iterated integrals. The 
following theorem will be needed. 

THEOREM 2.1 Let {y(t), t e T} be a process with orthogonal increments, 
with E{|dy|*} = dF, let 0(-, +) be measurable with respect to the Lebesgue- 
Stieltjes ds dF measure, and suppose that 


| 106, DÈ dF) < o 
A 


for almost all s (Lebesgue measure). Then the stochastic integral 
25) = | D, D dys) 
T 


can be defined in such a way that the 2(s) process is measurable. 

The problem is to take advantage of the lack of uniqueness of the 
stochastic integral to define 2(s) to get a measurable s, w function. To 
prove the theorem suppose first that ®(s, ¢) is a finite sum of the form 


Ds, t) = 2 D, (s)D (t) 
j 


with the first factors Lebesgue measurable, the second factors measurable 
with respect to dF measure, and 


J OOP dF) < o. 
vT 


Then the evaluation 
[6,9 dye) = Z fs) f Do) a) 
P; 


shows that the stochastic integral on the left can be expressed as the sum 
of products of functions measurable in each variable and therefore 
measurable in the pair. The general case is then treated by the usual 
approximation procedure. 

We proceed to the definition of the double integral (2.7). It is sufficient 


§2 STOCHASTIC INTEGRALS 431 


to consider only the case when 4 is the infinite strip — 00 <s < %, t €T, 
since the integral can be defined for other sets A by setting the integrand 
equal to 0 in the complement of A. Suppose then that ® is a function 
defined on the infinite strip, measurable with respect to ds dF(t) measure, 
and that either 


æ 


(28) faro[ f [®6, 9) ds? < œ 

Ta T -%0 

2.8”) fal flee, HP dF] < o. 
-@o id 


If (2.8’) is true, the iterated integral 


2.9’) z= | aW Í Ws, t) ds 

ir -o 
is well defined, with Ef|z'|?}} dominated by the left side of (2.8). If 
(2.8”) is true, the iterated integral 


2.9") {= Í ds j (s, t) dy(t) 
so 7 


is well defined, if the result of the first integration is chosen to be s, œw 
measurable, as it can be according to Theorem 2.1, and E{|z”|} is 
dominated by the right side of (2.8). Moreover, 


(2.10) P{z (w) =2"(w)} = 1 


if both (2.8’) and (2.8”) are true, that is, the order of integration is 
immaterial. To show this suppose that ® is given by 


Ms, t) = 2 (8) Po,(0), 
= 
where ,, is Lebesgue measurable, Ps, is measurable with respect to dF 
measure, and 


foo) ds < ©, [f |D,,(1)|? aro)? <0. 
Ss T 


Then (2.8’) and (2.8”) are both true, and it is trivial to verify that (2.10) 
is true. The proof in the general case that when (2.8) and (2.8”) are both 
true (2.10) is also necessarily true can be effected by the usual approxima- 
tion procedure. The double integral (2.7) with A the infinite strip 
— œ <s <0, teT is now defined when either (2.8) or (2.8”) is true 


432 PROCESSES WITH ORTHOGONAL INCREMENTS 1X 


as an iterated integral. This integral, according to our definition, is any 
one of a number of random variables, any two of which are equal with 
probability 1, and containing any random variable equal to one of them 
with probability 1. 

As an application of this double integral consider the following special 
case, Let h be an absolutely continuous f function in the interval 
[a, b] CT. 

Define Os, t) = k'(s), TEE 

= 0, s>t. 


Then, evaluating the double integral by iterated integration in both orders, 
(2.11) i f D(s, £) ds dy(t) = f (h(t) — h(a)] dytt) = f [y(b) — y(s)]A'(s) ds 
b = He) — h(a) ity) — vol 
= Í uC) — yK CE) dt 


with probability 1, if F is continuous at a and b. We have thus obtained 
the formula for integration by parts. If F need not be continuous at a 
and b, the formula is modified slightly, the modification depending on 
whether or not a and b are included in the domain of integration. 

In many applications the y(t) process has uncorrelated rather than 
orthogonal increments. If this is true, the process with variables 
{y(t) — m(t)} has orthogonal increments, if we define m(t) by 

m(t) = Ely(t)— y0). 
We then extend the definition of the stochastic integral as follows: 


CID | Ody) = | OE diy) — mO + [ OO ame, 

A A A 
where it is supposed that m is so regular that the last integral can be 
defined in the usual way. With this definition, if 


Elly — ms) — [y(s) — m|} = Ft) — Fs), s <t, 
then 


E| | DO) ayo} = | OC) dmt) 
(2.13) 4 4 


E( f OM ay [YO O) 
A A 


= | OF] are + | O dmt) | YO dio. 
A A: 4 


§3 APPLICATION TO CAMPBELL’S THEOREM 433 


It will be useful to have the variance of the stochastic integral, 
(2.14) E{| j De) dy(t)— | DO) dmi} = | (O0? dF). 
4 A A 


This stochastic integral was first used by Wiener [who supposed the 
y(t) process to be the Brownian movement process]. We shall see the 
theoretical applications to stationary processes in later chapters. Two 
simple practical applications will be given here. 


3. Application to Campbell’s theorem 


Suppose that events are occurring in accordance with a Poisson process, 
at a rate c> 0. Each event has a certain intensity u, and has an effect 
u®(t) after ¢ time units have passed. Let O(t) be the sum of the effects 
of all events occurring prior to time t. 


3.1) A(t) = X OG — tus, 

j 
where ty, t» * * * are the times of the events occurring before time ¢ and 
Wy, Up, °° * are the intensities at these times. It is supposed that 
Uy, Up, * * * are mutually independent random variables with a common 


distribution function. Let {y(t), — © <t < oo} be a stochastic process 
whose sample functions are constant between the events, and increase by 
the corresponding intensity at each event. Then the y(t) process has 
stationary independent increments, and 


Efyls + )— y(s)} = cat 


(3.2) 

Effy(s + 1) — ys) — cat} = cpt 
where 
(3.3) Efu} = a, E(u} = P. 


Finally, we can now write 6(¢) in the form 
t Ga 
84 aD = | D= 9 dyo) = | Oe- $) dO 


if we set Ọ(t) = 0 for t <0. Thus, 6(t) defines a strictly stationary 
stochastic process, a process of moving averages. With the notation 


(3.5) [e@a =a, [oat = 
0 0 


434 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


we find 
E{6(1)} = caa 
(3.6) 3 
E{{6(0) — caa} = cB | D)? dt = cpb. 
0 


In particular, if the w distribution is concentrated at a single value, 
f=. In this case the evaluation of the variance of 6(r) just obtained 
is known as Campbell’s theorem. 

According to Theorem 2.1, 6(t) can be defined in such a way, for each 
t, that the 0(¢) process is measurable. It is then possible to identify the 
expectations in (3.6) with time averages, as follows. The 6(t) process is 
strictly stationary and metrically transitive (see XI §1), and it then follows 
from the strong law of large numbers for strictly stationary processes 
(ergodic theorem, XI, Theorem 2.1) that 


lim Al O(t) dt = caa 
s 
0 


80 


(3.6) 
s 
lim : | L0G) — cxa}? dt = chb 
sr 
0 
with probability 1. 

In most applications the rate c at which events take place is very large, 
and if this is so it is easily proved, by characteristic functions for example, 
that the y(t) increments are nearly Gaussian, and that the 6(r) process is 
nearly Gaussian. This means that {y(t)— cat, — œ < t< œ} is very 
nearly a Brownian movement process (see VIII §2). To make this more 
concrete let {y,(t), — 00 < t < œ} be a process with independent Gaussian 
increments satisfying (3.2), so that the process {y,(t) — cat, — co < t < œ} 
is a Brownian movement process. Define 6,(¢) by 


(3.4’) 6,(t) = f D(t — s) dy(s). 


Then 6,(t) defines a strictly stationary Gaussian process to which the 
O(t) process is asymptotic for large c, and it is the 6, rather than the 0 
process which is usually treated in the applications. 


4. Fourier transform of a process with orthogonal increments 


We shall need the Fourier transform of a process {y(t), — co < t < o0} 
with orthogonal increments, satisfying 


(4.1) E{|dy(t)|"} = dt o> 0. 


$4 FOURIER TRANSFORM OF A PROCESS 435 


We wish to find a second process {y*(t)— œ <t < œ}, also with 
orthogonal increments and satisfying (4.1), for which (formally) 


YO = | eyo ds 
4.2) R” 


y*() = fete y'(s) ds. 


a] 


These equations are of course meaningless as they stand, since y’(t) does 
not necessarily exist. We interpret them by formal integration between 
Aand y, 


c3 


e2risn  e2rinh 
— = ee ES a ee hal 
wD- = | av") 
43) 
J tod R g 2risn L g- 2rish 
ro-ro- | w. 


-o 


We shall show that the second equation in (4.3) defines a y* process with 
orthogonal increments satisfying (4.1) and the first equation of (4.3). 
We shall use the notation ®, „ for the function which is 1 between A and 
u and 0 otherwise; ®,, ,* will denote its Fourier transform, the second 
integrand in (4.3). Then, from the second equation in (4.3), 


(4.4) Efly*(ea) — y* AI yee) — 9* Aad} 


= otf Da MD, O ds = 08 | Da, (3), l) ds, 


using the Parseval identity for Fourier transforms. In particular, if 
Ay < fy <A < Ma, the last integral vanishes. Hence the y* process 
[defined by the second equation in (4.3) with A = 0 say] has orthogonal 
increments. If 4; = < pı = Hs the last integral in (4.4) becomes 
fy — Ay. Thus (4.1) is satisfied by the y* process. In order to prove the 
first equation of (4.3) we prove the more general 


(4.5) [Owo = [FO ay"). 


436 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


Here f and f* are Fourier transforms of each other, 


fi) = | ent f(s) ds 
(4.6) ma 
"O= O) ds, 


and (4.5) will be proved for every Lebesgue-measurable f whose square is 
integrable, that is, for every Lebesgue-measurable f* whose square is 
integrable. For such functions the correspondence between f and f* is 
given by the Fourier-Plancherel theorem and in (4.6) the integrals must 
be interpreted as certain limits in the mean, in accordance with this 
theorem, so that equations (4.6) are only true for almost all ż. It is 
sufficient to prove (4.5) in order to prove the first equation in (4.3), 
because, if f= ®, ,, (4.5) reduces to the first equation in (4.3). We 
already know that (4.5) holds for f* = ©, „; this is the second equation 
in (4.3). The class of functions f* for which (4.5) holds is a closed linear 
manifold of functions, defining the distance between /,* and /,* as 


[ [iim—aeieasp® 


Since this manifold contains every ®,, ,,, it contains every f* of the stated 
class. 


5. A generalization of the stochastic integral of §2 


Let {y(t), te T} be a stochastic process and let ® be a function of 
teT and œ. Stochastic integrals 


p= | Di o) dy) 
A 


are used in many places in this book. In each case the y(t) process is of 
some special type, the functions ® are restricted to some linear class 
which depends on the given y(t) process, and A is restricted to some 
specified class of ¢ sets, The general principle involved in the definition 
of the stochastic integral is the following. It is always supposed that T 
is an interval, which may be infinite. Under the hypothesis that A = T 
and that ® is a step function, the stochastic integral is defined as the 
obvious Riemann-Stieltjes sum. Keeping A = 7, the stochastic integral 
is then defined for the general ® by a limit procedure. Finally the 
integral over A is defined by 


| D(z, w) dy(t) = KJG w) dyli), 
A v 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 437 


where 
© ,(t, w) = V(t, w), teA, 


= 0, t¢A, 


This procedure has already been used in §2. In the present section a 
related case will be treated. The definition will be given for 
A=T=(— ©, ©). The extension to other cases will be obvious. The 
following hypotheses are made. 

I, The process {y(t), Fe — © <t < œ} is a martingale. (See VII 
$1.) There is a monotone non-decreasing function F such that, if s <1, 


E{ly() — yO} = Efl) — yO? F} = FO— Fs) 


with probability 1. 

In particular, if F(t) = const. t, if the martingale is real, and if almost 
all its sample functions are continuous, the y(t) process is a Brownian 
motion process, according to VII, Theorem 11.9, and this is the most 
important special case. 

I, © is a (t, œ) function measurable with respect to dt dP measure. 
For each s, ®(s, +) is œ measurable with respect to the field F, Finally 


| EG, oè} dF) < o. 
The stochastic integral 
(5.1) f OG o) ayo 


will be defined in such a way that, if the integrands ® and Y¥ correspond 
to the integrals y and y, 


Ef} = E | fon, w) dy() = 0, 
(5.2) N 


Efpp} = | HOU, o) FC, o}dF(O). 


These equations generalize (2.13), to which they reduce if ® and ¥ are 
functions of alone. The y(t) process has orthogonal increments because 
of L. The stochastic integral of this section is more general than that 
discussed in §2 in that the integrand may depend on as well as on ¢, 
but less general in that the y(t) process is a martingale, instead of merely 
having orthogonal increments. 


438 PROCESSES WITH ORTHOGONAL INCREMENTS 1X 


We shall use the following fact in the discussion below without further 
comment. If ©) is measurable with respect to the field F , and if 
E{|®(w)|} < 00, then (assuming Iı), if to < f < t2 


(5.3) E{|O(o)| |y(t) — yn) 7} < © 
and 
E{O()[y(t2) — yh) = 0 
Ey) — We} = EDIF) — F). 
To prove (5.3) and (5.4) we remark that, if (5.3) is true, then 
E{|O(@)] yta) — yl} < ©, 
by Schwarz’s inequality, and hence 
EMC) y(te) — yD = E(D Ely) — y(t) lF) 
= E{®(w)0} = 0. 


‘Then the first equation of (5.4) is true, and the second is proved in the 
same way. Thus we have only to prove that, if E{|®(w)|} < œ, (5.3) 
must be true. To prove this define 


®,(w) = (a), KOET 
= 0, |®(w)| > n. 


Then (5.3) is true with ® replaced by ®,, so that from what we have 
already proved 


E{|®,(o)| [U — Y} = E{|®,(o) LF) — F(a). 
When n —> œ this equation implies (5.3). 
We now define the stochastic integral (5.1). Assume first that ® is a 
function of the form 
D(t, w) = 0, t<a 
= 0,0), a,<t<aj, jcn—1l 
=0 GST 


where a <>: ><a,» ®, is measurable with respect to F,, and 
E{|®,(@)|?} < ©. Any such function will be called a (/, œ) step function. 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 439 


In this case we define 
p = J D0, 0) dyle) = 3 Poa.) = 4M 
“o 

and in fact we accept as the integral any random variable equal almost 
everywhere to the sum on the right. With this definition ¢ is determined 
uniquely by ®, neglecting values on % ‘sets of measure 0, and (5.2) is true. 
We have thus set up a correspondence between (t, œ) step functions ® 
and certain random variables g in which linear combinations of O's 
yield the corresponding linear combinations of g's. If distances between 
pairs of functions ® satisfying I, and between pairs of random variables 
p, with Ef|p|?} < 00, are defined respectively by 


il, — Dl] = [ | EUD, o) — O46, o)? ara|” 


lp — Pall = Ep1 — val? 


this correspondence is distance preserving because (5.2) is satisfied. 

Then, just as in §2, if © is a limit (in the sense of ® distance) of a 
sequence of (f, œ) step functions, we define the stochastic integral (5.1) 
as the limit (in the sense of p distance, that is, limit in the mean) of the 
corresponding integrals. With this definition (5.2) will clearly be satisfied 
in all cases. There remains the proof that the closure of the class of 
(t, w) step functions includes all functions satisfying I,; that is, the proof 
that, if W is the class of functions satisfying I, which can be approximated 
arbitrarily closely (in terms of ® distance) by (f, œ) step functions, M 
includes all functions satisfying I, We proceed to this proof. We can 
write F(t) in the form 

F(t) = F(t) + FO), 


where F} is a monotone non-decreasing function increasing only in jumps, 
at TiTa a the discontinuities of F, and F, is monotone non-decreasing 
and continuous. We shall suppose in the following that F(t) = t. Even 
if this is not so, the change of variable t = F(t), H(t) = yr), will make 
it so. (If F, is bounded, the integrals will then become integrals between 
finite limits, but this causes no change in the argument. An obvious 
convention is to be used if the transformation from z to ¢’ is not one to 
one.) We prove first that for a given k, if (-) is an w function measurable 
with respect to the field F, with E{|®(«)|?} < 00, then the ¢, œ function 
uy! O(t,0) =O), =m 


10; LER 


440 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


is in the class M. We prove this by exhibiting a sequence of (t, œ) step 
functions which converges to (-, +) in the sense of ® distance. In fact, 
,(-, *) defined by 


®,(t, w) = Ow), t,t <7} 4+ I/n, 
= 0, otherwise, 
is a (f, œ) step function and 


|G, -)— ©, ||? = EDO HFC + Wn —) — Fre +)] +90 
(n > œ). 


The class M is a linear manifold. Hence finite sums of functions ®(, +) 
of the above type are also in M. Since the class W is closed in the sense 
of ® distance, a function ®(-, -) defined by 


P(t, w) = Pw), t= ty b= lye *: 
=0, otherwise, 
is also in M, if each ®, is measurable with respect to the field ¥,, and if 
LEO) +) Fre] < o. 


In fact, the function (-, -) is then the limit in the sense of ® distance of 
finite sums of the functions of the first type considered. The functions 
which we have now proved are in M are precisely the functions 
(-, :) which satisfy I, and which vanish except when 1 {7,}. We next 
suppose that F is continuous, so that F(t) = F,(t) = t, and prove that, 
if DC, +) satisfies I,, then (-, -) eM. It is clearly sufficient to prove this 
for functions (+, ) which are bounded and which vanish for ż outside 
some finite interval. Suppose then that (-, -) has these properties. 
Define «,(t) by 


JOM Sie po fond 


j=>, < s 
nl ) Qn 2” 2n J 


0, 


esu 


Then [a(t — s) + s, w] defines a (r, w) step function, and it is sufficient 
to prove that s can be chosen in such a way that 


lim ||®[z,(t—s) + s, w] — O(, w)|| = 0. 


To do this we prove first that, if f is a bounded Lebesgue-measurable 
function of s which vanishes outside some finite interval, then 


6.5) lim { [f(s +A- fods = 0. 
hoo "o 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 441 


In fact, for every e > 0 there is a continuous function f,, vanishing outside 
some finite interval, with 


[1A -LOPas< @ 
so that, using Minkowski’s inequality, 


el: | [ins +l) fo)? ds]! < tim sup | i [L6 +h) flas] 


A> 


+ 2e = 2e. 
According to (5.5), 


tim { [O(s + h, o) ~ Ds, o)l? ds = 0 
h0 o 
for almost all œ. Then, for each ż, 


lim J (Olat) + 5, 0] — D + s, w)|? ds = 0 
nos "o 


for almost all œ. Hence 


wo oO 
tim f f | [012.0 + 5, 0] Dl + s, |? ds de dP = 0, 
n>a Ñ -o =o 


that is, the integrand converges to 0 in (s, f, œ) measure when n —> %. 
There is then a sequence of integers {nj} and a value of s such that 


lim Í f (Olan (0) + so] — DU + s, @) |? dt dP 
Q -w 


joo 
®© 


= lim Í Í [Dla t — s) + s, 0] — D(t, w)|? dt dP = 0 

Ho n -ow 
for almost all œ, as was to be proved. 

We have shown that the class W of functions ® contains all the functions 
satisfying I, if F is continuous, and we showed earlier that in any case 
M contains all functions satisfying I, which vanish except at the dis- 
continuities of F. These two results are combined as follows. Let 
(-, +) satisfy Ip, and define 

D(t, wo) =D, o), te {ra 
=0, t dira) 
D(t, w) = D(t, w) — D(t, w). 


442 PROCESSES WITH ORTHOGONAL INCREMENTS 1X 


Then ®,(-, -) and ®,(-, -) both satisfy I,, and ®,(-, :) eM since it vanishes 
except when t e {z,}. There remains the proof that ®,(-, -), which satisfies 
I, and vanishes if ¢¢<{7,}, is also in M. We have already shown that 
there is a sequence {’,(-, -)} of functions in M [in fact of (t, œ) step 
functions] such that 


lim J Ei[®(t, o) —,(¢, @) |} dt = 0. 


If we modify each ¥,(f, œ) by setting it equal to 0 when re {rx} the 
modified function still is in W and we now have 


lim |J,—¥,|2 = tim f E(D, o) —¥,(6, @))} dF@) = 0. 


Then ®,(:, -) € M, as was to be proved. 

The definition of the integral (5.1) is now complete. We defined it 
first for (-, -), a (t, œ) step function, and then for all functions which 
are limits of (¢, œ) step functions in the sense of ® distance. We then 
proved that the class of integrands obtained in this way includes all the 
functions satisfying I. Actually it is a slightly larger class in general. 
The stochastic integral over any Borel f-set, or f-set measurable with 
respect to dF measure, is defined by setting the integrand equal to 0 
outside this set. We observe that because of the method of definition 
the stochastic integral is only uniquely defined neglecting œ sets of 
probability 0; that is, to any integrand corresponds a whole family of 
integrals, any two of which are equal for almost all w. 

If ®(-, +) is actually a function of ¢ alone, this integral reduces to the 
one defined in §2, but in this case, as we have seen in §2, the method of 
proof only requires that the y(t) process have orthogonal increments. 

In particular, if the integrand vanishes except at the set {rą} of dis- 
continuities of F, the stochastic integral is easily seen to be 


J 6, 0) dit) = 3 Grp oyl H= 91h 


where the series (if infinite) converges in the mean, In general, for 
purposes of evaluation we note that if a = ft <` * > < tm = b, and if 
a(l) is defined by 


a= ty GSt<tyy j<m 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 443 


then, if F is continuous at the t,’s, 
66 ue {| | [06 o- of, o] ayot) 
= | EDG, 0) — O[a(0, olf} dF) 


m-1 fit 
= 2 | EIDE, o) — Dts, oP} dF. 


t 


Example Let ®(t, w) = y(t, œ) — y(a, w). Then we shall evaluate 
b 
f O- Al dO 
a 


for two different y(t) processes. In each case dF(t) = dt, so that the last 
term in (5.6) becomes 


m—1 “a 

5 | @- 4) dt < (b— a) Max Gn t): 

j=0 7, j 

Then, when Max (t;,,— t) is small, the stochastic integral of ®[a(t), œ] 


d 
will be nearly that of ®(r, w). The former integral is easily calculated, 
since ®[a(t), w] defines a (¢, w) step function, 


v m~i 
if Dial), o] dy(t) = 2, [y(t — KDY) — YO) 
a j= 
m—1 
= iyo) —y@?P— +, Wta) — YEP. 
It follows that, if 6 = Max (f; — 4), 
b m-1 
| y(t) — yla)] dy(t) = ly) — y@P— t Liem. > Y) — y(t)? 
$ 60 j=0 
We observe that a formal integration would give only the first term on 


the right. Suppose now that y(t) = z(t)— t, where the 2(t) process is a 
Poisson process (see VIII $4) with rate parameter 1, so that 


(5.7) Efdy(}=0, — Eftdy(QP} = dt. 
Then, with probability 1, ats) <* °° Salty) and, with probability which 


444 PROCESSES WITH ORTHOGONAL INCREMENTS 1x 


approaches 1 when 6 —> 0, two successive values in this finite sequence 
are either equal or differ by 1. It follows that for almost all œ 


m-1 
(5.8) lim > [Yt — yt) = yb) — yla), 


0 j=0 


if the zż;’s are restricted to be in some enumerable set. Then 


b 
MTORO) dy(t) = Zy) — yP — 3yh) — yla]. 


As a second case suppose that the y(t) process is a Brownian motion 
process with variance parameter | (see VIII §2), Then (5.7) is true once 
more, but (5.8) must be replaced, according to VIII, Theorem 2.3, by 


m- 


1 
Lim. > (y(t) — yi)? = b— a, 
ô>0 j=0 


so that in this case the stochastic integral in question has the evaluation 


b 
f YO= yO] dy) = Sy) — WP O = a). 
We now consider the x(t) process defined by 
(5.9) a(t) = J Ws, w) dys) t >a. 


This process is not uniquely determined, in the sense that, for each f, 
a(t) is only determined up to values on a set of measure 0, and we are 
therefore free to modify any given choice of x(t), for each f, on an w set 
of measure 0. According to II, Theorem 2.4, this leeway makes it 
possible to define a(r) in such a way that the x(t) process is separable. 

THEOREM 5.1 Every x(t) process defined by (5.9) is a martingale. 

Let b > abe some fixed number. It was shown that there is a sequence 
{®,(-, -)} of (t, œ) step functions such that 


b 
lim f EO, 0) — ©,(¢, «)[?} dF() = 0. 
ne y 

The stochastic integral x„(t) defined by 


t 
z(t) = | ®,(s, o) dy(s) 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 445 
is a certain finite sum, and, for fixed t e [a, b], 


a(t) = Lim. 2,(¢) 
no 
since 


t 
E{la(e) — zA} = f ED, w) — ©,6, o)l} dF) 


b 
< | E{l@(s, o) — ©,6, 0))?} dF) > 0. 


Now, if t < tp 
Efx (h) |F} = tlt) 


with probability 1, by inspection of the sums which define x,,(t), that is, 


| x(t) dP = Í a(t) dP, AeF, 
A 


A 


When n —> œ this equation becomes 


faar = f x) aP, 
A A 
so that 

Efel) | Fj = 20) 


with probability 1. This equation implies that the process {x(1), -F,, t € T} 
is a martingale. 

THEOREM 5.2 The stochastic integral (5.9) can be defined for each t in 
such a way that the x(t) process is a separable martingale. Almost all 
sample functions of this process will then have one-sided limits at all points. 
The fixed points of discontinuity of the x(t) process are points of discontinuity 
of F. If the sample functions of the y(t) process are almost all continuous, 
those of a separable x(t) process will almost all be continuous also. 

Only the last two statements of the theorem remain to be proved. 


Since 
ty 


z|| f o, oyo} = f BOE, ohare, 


ti 


a fixed point of discontinuity of the a(t) process must be a discontinuity 
of F. If almost all sample functions of the y(t) process are continuous, 
F must be continuous, since 


E{ y(t) — Yt) 7} = F) — F(t). 


446 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


Then, according to VII, Theorem 11.9, the y(t) process is the Brownian 
motion process after a change of the time variable. We prove that almost 
all the (separable) x(t) process sample functions are continuous in this 
case by first remarking that the statement is obvious if ® is a (z, œ) step 
function, and then proving the general statement by approximation, as 
follows. It is sufficient to prove it for f in a finite interval [a, b]. Let ®,, 
be a (t, œ) step function with 


E{|O(¢, w) — D(t, w)|2} dF(t) < z 


acao 


and define an x,„(t) process by 


z(t, w) = | ,(s, o) dyl). 


Then 


t 
a(t, 0) —2,(t, 0) = | [®(, o) — (5, o)] dy(s), 


so that the x(t) — x,„(f) process is a martingale, which we can suppose is 
made separable by a proper choice of x,(t) for each t. Hence, by the 
continuous parameter version of III, Theorem 2.1, or of VII, Theorem 3.2, 
applied to the |a(r) — x,(t)|? process, 
1 2 
PI L.U.B. |x(t)— x,(0)| = a < E{|x() — x, (6) PP? < ee A 
a<t<b n ron 
It now follows from the Borel-Cantelli lemma (III, Theorem 1.2), since 
the right side of the preceding inequality is the general term of a con- 
vergent series, that 


lelt, w) — xlt, w)| < 1/n, a<t<b 


for sufficiently large n, with probability 1. This implies that, with 
probability 1, the x,(t) process sample functions converge uniformly to 
those of the a(t) process, which proves the desired result. 

Certain applications in VI make it desirable to generalize the stochastic 
integral (5.1), in the case where F is absolutely continuous, by relaxing 
the condition on the y(¢) process. Condition I, is replaced by: 

I,’ The process {y(t), F ,, — © <t < œ} is a martingale, with 


E{{y() — y(s)|?} < 00 


Jor all s and t; f is a non-negative (t, œ) function defined and measurable 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 447 


with respect to dt dP measure, such that, for each s, f(s, *) is œ measurable 
with respect to the field F , and that, if s < t, 
t 
E| | flu, @) du) <0 


and d 


EMOD- YO IF) = E| f feu 0) du |F 


with probability 1. Condition I, is replaced by: 
Iy © is a (t, w) function measurable with respect to dtdP measure. 
For each s, ®(s, *) is w measurable with respect to the field F, Finally 


[ EDC, PE, o} dt < 0. 


In particular, if f is in fact a function of t alone, the hypotheses become a 
special case of I, and I, with F(t) = f(t). 

We now outline the definition of the stochastic integral (5.1) in the 
present case. Few changes are needed. However, (5.2) is replaced by 


E{g} =0 
E{pp} = | EO, 0) VE, ofl o)} dt, 


=o 


(5.2) 


and, in the discussion of the definition, the distance between functions ® 
and ¥ is defined by 


llo — FI] = | f ELEC, o) — Ee, OPM o) at". 


The new (z, œ) step functions are defined like the old ones, except that, in 
the notation of the old definition, it is supposed not that ® is integrable 
but that 


Aye 

J EUDE o} dt < o: 

ay 
To show that any © satisfying I; can be approximated arbitrarily closely 
in terms of the new distance, by new (t, œ) step functions, we can assume 
that © is bounded, |®|< K, and vanishes outside a finite interval [a, b]. 
Then © satisfies I, and hence there is a sequence of (old) (/, w) step 
functions ®,, ®,, > > + such that 

b 

lim | E{(@G, o) — ©,(¢, «)} dt = 0. 


nD g 


448 PROCESSES WITH ORTHOGONAL INCREMENTS 1X 


Moreover our earlier discussion shows that we can assume that |®,|<K. 
Then D, converges to Ọ in dt dP measure, so that |Ð — ®,,|?f converges 
to 0 in dt dP measure, and the latter function is dominated by the dt dP 
integrable function 4K?f. Hence 


b 
lim | E{[O(, o) — (6, o), «)} dt = 0, 


and this is the desired approximation result. Theorems 5.1 and 5.2 
remain true, and their proofs require no change of method. There are 
now no fixed points of discontinuity in Theorem 52 

Finally we remark that, if «(¢) is defined by 


t 
x(t) = f D(s, w) dy(s), 


the a(f) process is a martingale, according to Theorem 5.1, and 


Efalt) |F} = 2) 


t 
Efle(t) — a(s)|? LF } = E| J [Du «)|* flu, o) du |F.) 


=F { fu, o) du IF.) 
with probability 1, where 
filly ©) = |, o)l? ©). 
Thus the a(t) process and function f; satisfy the same hypotheses as the 


y(t) process and function. This means that we can consider stochastic 
integrals of the form 


| O16, o) deto). 
Moreover, 


b b 
J 6, ©) dett) = f M6, 0) OG, w) dyl), 


because this equality is true if ®, is a (t, w) step function. In particular, 
if © does not vanish, or at least vanishes at most on a (t, œ) set of ds dP 
measure 0, 
t 
dx(s) 
D(s, w) 
a 


yt) — yla) = 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 449 


Now consider the following problem: What æ(r) process can be repre- 
sented in the form (5.9), with the y(t) process a Brownian motion process? 
As we have seen, it is no restriction to assume that the a(r) process defined 
in this way is separable. If we assume that the Brownian motion process 
has variance parameter |, we find that, if 4 < t» 


ty 
(5.10) Effet.) — t|? | F,} = E| | |G, «|? ds IF} 
h 


with probability 1. Moreover, according to Theorem 5.2 the a(t) process 
sample functions are almost all continuous. The following theorem 
shows that these two conditions are sufficient. 

THEOREM 5.3 Let {a(t), Fp a<t<b} be a martingale with the 
following properties. 


(i) Efe} <o astsbd, 


Gi) Almost all sample functions of the process are continuous in (a, b). 

Gii) There is a non-negative (t, w) function ®, measurable with respect 
to dt dP measure, such that, for each $, (s, +) is w measurable with respect 
to the field F „ and such that, if ty < ty (5.10) is true with probability 1. 
Then, if © vanishes almost nowhere on (t, w) space, there is a Brownian 
motion process {y(t), a= t= b} such that for each t € [a, b], 


t 
(5.11) a(t) = x(a) + | Dls, o) dyl) 


with probability 1. Even without this additional hypothesis on the vanishing 
of ®, the statement is true after a Brownian motion process has been 
adjoined to the a(t) process. 

See II §2 for the meaning of the adjunction of a Brownian motion 
process to the given process. As a trivial example of a case in which this 
adjunction is necessary, consider an œ space consisting of a single point, 
and define æ(t, œw) = 1 for all t. The hypotheses of the theorem are 
certainly satisfied in this case, and ® = 0, Obviously there is no 
Brownian motion process {y(t), a< t< b} because œ space is not 
sufficiently complex to carry such a process. 

According to the hypotheses of this theorem we can define a y(t) 
process by : 

dzx(s) 
yo = Í Ds, w) 


450 PROCESSES WITH ORTHOGONAL INCREMENTS IX 


(defining 1/®(s, w) to be 0 if O(s, w) = 0). The y(t) process is a martin- 
gale and, if h < t» be 
A ()— F= E| 196, oF as Ea 

Y2) — Yt wit |O(s, o|? Fh 

t 


with probability 1. In particular, if ® never vanishes, or even if it 
vanishes almost nowhere in (t, œ), it follows that the right side of this 
equation becomes f, — ñ. Moreover, according to Theorem 5.2, the y(t) 
process can be defined to have almost all its sample functions continuous. 
Hence by VII, Theorem 11.9, the y(t) process is a Brownian motion 
process, and the inversion of the defining formula gives (5.10). If ® may 
vanish, the right side of the preceding equation is no longer ty — t; and 
the preceding argument is no longer valid. Suppose, however, that 
there is a Brownian motion process {2(t), a< t< b} for which, if 
h<t)h=<t' A 
E(t.) |F} = 2(4) 

E{|2(ta) — 2(4)|? lF} = (n 4) 


E({(x(t,) — (4) Nz’) — aty’)] | F,,} = 0 


with probability 1. Such a process can always be obtained by the 
adjunction procedure. Define 


Plt, o) = 1, if O(1,) =0, 
=0, if ®(¢,0) 40, 
t 
dx(s) 
Ms, w) 
a 


t 
W(t) = + | Dls, o) dels), 


choosing the integrals to get a separable y(t) process. Then the y(t) 
Process has continuous sample functions, by Theorem 5.2, Moreover, 
since DyD = 0, 


(5.12) EÍ | Í Dls, w) de(s)|*| = j E{|®,0|?} ds = 0, 


using the rules of combination we have developed for stochastic integrals. 
The following equations (true with probability 1) are now easy to verify: 
(s, w) |2 
(s, w) 
ty 

+ f |®o(s, w)|? ds 


t 


ds 


t 
6.19) El-se 17) = Ef | | 


F.) EU 


§5 GENERALIZATION OF THE STOCHASTIC INTEGRAL 451 
t 


t 
(5.14) foes w) dy(s) = f a 


t 
= | (1 — Dp) dz(s). 


Now, from (5.14), 
t t 
(5.15) alt) = x(a) + | Das, o) dels) + | OCs, o) dys) 


with probability 1 and, from (5.12), the first integral on the right may be 
omitted. Finally the y(t) process is a Brownian motion process because 
(see VII, Theorem 11.9), its sample functions are almost all continuous, 
and (5.13) is true. Then (5.15) reduces to (5.11). 


CHAPTER X 


Stationary Processes— 
Discrete Parameter 


1. Generalities; metric transitivity 


(a) Strictly stationary processes Strictly stationary processes were 
defined in If §8, The study of these processes is usually carried out by 
non-probabilists in the language of measure-preserving transformations, 
but the connection between the two ideas is not always clearly stated in 
the literature, and we proceed to develop it in some detail. 

We adopt our usual basic hypothesis: there is a probability measure 
defined on a Borel field of sets of some space 2. A transformation T 
taking points of Q into points of Q is called a 1-1 measure-preserving 
point transformation if it is 1-1, has domain and range Q, and if it and its 
inverse take measurable sets into measurable sets of the same probability. 
Such a transformation induces a 1-1 transformation of random variables 
into random variables (which we shall also denote by T), if we define 


[Tx](w) = (T70). 


Under this transformation, if x is a random variable which is | on a 
measurable w set A and 0 otherwise, Tx is 1 on the image of A under T, 
and 0 otherwise. For any a, Tx has the same distribution as x, and in 
fact the stochastic process 


{t,— O<n<o}, 2x, = Tz, 


is strictly stationary. Thus any measure-preserving point transformation 
can be used to generate strictly stationary stochastic processes. 

A set transformation T defined on the Borel field of measurable 
œ sets, taking measurable w sets into measurable w sets, is called a 
measure-preserving set transformation if the following conditions are 
satisfied : 

MP, T is single valued, modulo sets of probability 0: if A, is an image 
of A under T, the class of all images of A is the class of all measurable 

452 


$1 GENERALITIES 453 


sets differing from A, by sets of probability 0. In the following the 
notation TA will always mean a particular image of A under T. 


MP, P{TA} = P{A}. 
MP, Neglecting w sets of probability 0, 
T(A, U A) = TA, U TA, 


TOA) SURE AG 
n=1 n=1 
TO- N= OTA, 


Condition MP, implies that, neglecting sets of probability 0, finite and 
enumerably infinite intersections go over into the corresponding inter- 
sections under T, and that, if A, C A, then TA, CTA, If every 
measurable set is the image of some measurable set under T, this trans- 
formation must be 1-1 neglecting sets of probability 0 (that is, the trans- 
formation is 1-1 modulo the sets of probability 0) and the inverse T~ is 
thus defined, and is also a measure-preserving set transformation. We 
shall describe this situation by stating that T has an inverse. 

A 1-1 measure-preserving point transformation T induces a measure- 
preserving set transformation, if we define the family of images of a set 
A as all the w sets differing by at most a set of probability 0 from the 
image of A under the point transformation, This set transformation 
evidently has an inverse. However, a set transformation with an inverse 
cannot always be generated in this way by a point transformation. 

We shall see that measure-preserving set transformations can always 
be avoided in the theory of probability, in favor of the simpler 1-1 
measure-preserving point transformations, but this avoidance obscures 
the significance of the results to some extent. 

If T is a measure-preserving set transformation, there is one and only 
one transformation T, defined for every random variable, taking random 
variables into random variables, and having the following properties: 

RV, T; is single-valued modulo the random variables which vanish with 
probability 1: if a, is an image of x under Ty, the class of all images of x 
is the class of random variables equal to 2, with probability 1. In the 
following, the notation T,« will always mean a particular image of v 
under T). 

RV, T, is consistent with T: if x is a random variable which is 1 on 
a measurable w set A and 0 otherwise, then T,x is a random variable 
which is 1 almost everywhere on TA and 0 otherwise. Thus T, can be 
considered an extension of T. 


454 STATIONARY PROCESSES—DISCRETE PARAMETER X 


RV, T, is linear: if a, b are constants and if z, y are random variables, 


T,(ax + by) = aTyx + bT,y 
with probability 1. 
RV, T, preserves convergence: if lim x, = x with probability 1, then 


n> 
lim T,x,, = Tx with probability 1. 
i> 
; There is at most one transformation T; with these properties, since two 
transformations with these properties would be identical, by RV, and 
RVs, for the random variables taking on only finitely many values, hence 
by RV, for all random variables. We show that there is one such T, 
by defining it explicitly, as follows. It is sufficient to consider only real 
random variables, since in the complex case the real and imaginary parts 
can be treated separately. For each rational r define - 


A, = Tia) < r}, 
choosing the set transforms, as we certainly can, so that 
A Gu, ri <r 
ANOANO 
Now define $ ; 
[T,z](w) = s if w eA. Ss 


res 


The w function T% is defined for all w, and 
Mtro) < s} = {Q A} = Theo) < 3}, 


neglecting sets of probability 0, so that Ty# is a random variable, that is, 
a measurable w function. If we define the class of images of x under T, 
as the class of all random variables equal with probability 1 to the image 
just defined, it is easily verified that T, has the desired properties. In the 
following we shall use the same notation T for both the set and random 
variable transformation. In particular, if T is a measure-preserving set 
transformation derived from a 1-1 measure-preserving point transforma- 
tion S, it is clear that 
[Tz] (w) = 2(S4) 

with probability 1. 

APT isa measure-preserving set transformation, and if x is a random 
variable, the stochastic process 


{t,n>0}, 2, = etc: 
is strictly stationary, and, if T has an inverse, the stochastic process 


femo <n<o}, e, = Tr, 


NI GENERALITIES 455 


is strictly stationary. Thus a measure-preserving set transformation can 
be used to generate strictly stationary stochastic processes. 

We have supposed above that a measure-preserving set transformation 
T has as domain and range space the class of all measurable w sets. But 
it is obvious that nothing whatever would be changed in the discussion 
if domain and range space were any Borel field of measurable sets including 
all sets of probability 0. This trivial extension of the idea will be con- 
venient below. The induced transformation of random variables will 
then have as domain and range space the class of random variables 
measurable with respect to the field in question. 

We now consider a strictly stationary stochastic process {x,, n > 0} 
and study the conditions under which it can be generated by a measure- 
preserving set transformation, as described above. In studying this 
stochastic process, the only measurabie w sets of interest are those deter- 
mined by conditions on the a's, that is, those measurable on the sample 
Space of the x,’s. We, therefore, wish to determine a measure-preserving 
set transformation, with domain the class of sets determined by con- 
ditions on the a,’s, with the property that T’% =2,,n> 1, with 
probability 1. It will be shown that there is such a transformation and 
that it is uniquely determined. The transformation is called the shift, 
because Tx, = £p} 1 > 0, with probability 1. If there is such a T, it 
will take the œ set 

A = {le(o),- + ,2,(@)] € 4}, 


where A is a Borel set, into the sets differing by at most sets of probability 
0 from the set 
{[22(@), ° + +, En (0)] € A}. 


Now if two shift transformations exist they will agree on sets of the above 
type, and therefore on all the sets of the Borel field determined by those 
of the above type. They will, therefore, agree on all sets determined by 
conditions on the x;'s. Thus, if there is a shift, there is only one. There 
is actually a shift since we can define T as indicated for the sets of the 
above particular type, and then extend the definition to all sets determined 
by conditions on the xs. (In fact, if the distance between measurable œw 
sets A,, A, is defined as 


P{A(Q — Aj)} + P(Q— ADA}, 


the space of measurable sets becomes a complete metric space, and the 
desired shift becomes a single-valued distance-preserving function on a 
closed subset of the space, the set E corresponding to the sets determined 
by conditions on the xs. The sets A defined above form a subset of E 
dense in £, and T as defined on this subset is distance-preserving and is 


456 STATIONARY PROCESSES—DISCRETE PARAMETER x 


therefore a uniformly continuous function. There is, therefore, one and 
only one way of extending the definition of T, preserving continuity, to 
all of E.) Similarly, if {x,, —œ <n < œ} is a strictly stationary 
stochastic process, there is a uniquely determined measure-preserving set 
transformation, the shift (relative to this process) with domain the Borel 
field of sets determined by conditions on the «;’s, such that T”) = £, 
for all integers n. 

We have already remarked that the introduction of measure-preserving 
set (rather than point) transformations can be dispensed with in this 
subject. We now proceed to justify this statement. Let {x,,,0 <n < oo} 
be a strictly stationary stochastic process. We shall find a related sto- 
chastic process, which for most purposes can be used to replace the given 
one, and is generated by a 1-1 measure-preserving point transformation. 
This is done as follows (see I §5 and §6 and II §1). Let © be the coordinate 
space of all sequences @ of numbers {: - -, &, &, © + +}; we define a 
probability measure in this space by assigning a finite dimensional distri- 
bution to each finite set of coordinate variables. Let z, be the nth 
coordinate variable of ©, so that #,(@) = én if @ is the point ( © =, &, 
éa ++). The & set 

{Bn (@) € Ajj = 1,0 + +n} 
is assigned the probability 
Pim, 4.4(@) €A,j=1,: > +, n}. 
Here the A,’s are Borel sets and A is chosen so large that 
Mette 0) fa sn 


According to Kolmogoroy’s theorem on infinite dimensional measures 
(see I §5), this assignment of finite dimensional distributions determines 
a probability measure on a Borel field of @ sets. The coordinate function 
, is a random variable on ©. The Strictly stationary stochastic process 
{n — © <n < co} has the property that the multivariate distribution 
of the random variables %, + + +, ,, on @ space is the same as that of the 
random variables a, +- +, Ta ON w space. Thus, if only questions 
involving these distributions are to be considered, we can use the #,, 
process instead of the x, process, For example, consider the averages 


baie Ln 
n+l 
The law of large numbers makes assertions about the convergence of 
these averages when n —> œ. It is clear that these averages converge in 
the mean, stochastically, or with probability 1, if and only if the corre- 
sponding ,„ averages do so. From the present point of view, the #, 


> n>0. 


SI GENERALITIES 457 


process is more advantageous than the x, process in that the shift for the 
2, process is generated by a 1-1 measure-preserving point transformation, 
the transformation which simply shifts each coordinate one step. 

We have supposed here that the ,, process has as parameter range the 
set of non-negative integers. If the parameter range is the set of all 
integers, the &, process is defined in the same way, and becomes what we 
have called in I §6 a representation of the x, process. 

We have now shown that in discussing strictly stationary stochastic 
processes we are really discussing the iterates of a shift operator acting 
on a given random variable, and that for many purposes we can even 
assume that the shift is a 1-1 measure-preserving point transformation. 
In the following we shall use interchangeably the language of stochastic 
processes and that of measure-preserving set transformations. 

A measurable set is called invariant under a measure-preserving point 
or set transformation if it differs from its images by sets of probability 0. 
Every set of probability 0 or 1 is then invariant. The invariant sets form 
a Borel field. A random variable x is called invariant under a measure- 
preserving point or set transformation T if Tz = v with probability 1. 
Then every function identically constant with probability 1 is invariant. 
If T has an inverse, the same measurable sets and random variables are 
invariant with respect to the inverse as with respect to T. 

If A is a measurable œ set, and if x is a random variable which is 1 on 
A and 0 otherwise, then A is an invariant set if and only if x is an 
invariant random variable. If a is an invariant random variable, the œ 
set {æ(w) e A} is an invariant w set for every Borel set A. Conversely, if 
the w set {x(w) e A} is an invariant set for every Borel set A, or even (in 
the real case) for every interval A, then x is an invariant random variable. 

A measure-preserving point or set transformation is called metrically 
transitive if the only invariant sets are those which have probability Oorl, 
that is, if the only invariant random variables are those which are identi- 
cally constant with probability 1. 

We have seen that to each strictly stationary process there corresponds 
a unique measure-preserving set transformation, the shift, defined on the 
© sets determined by conditions on the random variables of the process. 
Sets and random variables invariant relative to this transformation are 
called invariant sets and random variables of the process, and the process 
is said to be metrically transitive if the shift is. A process is metrically 
transitive if and only if the #, process defined above, the corresponding 
process on coordinate space, is metrically transitive. 

Let T be a metrically transitive measure-preserving set transformation, 
and let x be a random variable. Then the process 

{z,,n =>0}, %,=T"s, 


458 STATIONARY PROCESSES—DISCRETE PARAMETER xX 


is metrically transitive. To see this we note that, if-T, is the shift corres- 
ponding to this stochastic process, T and T, are identical on the w sets 
determined by conditions on the 2's, that is, T = T, on the domain of 
definition of T,. It follows that, if an œ set is invariant with respect to 
T, it will be invariant with respect to T, and therefore that T, is metrically 
transitive, because T is. 

An important particular case of the result of the previous paragraph 
is the following. Let {x,, — co <n < oo} be a strictly stationary 
stochastic process which is metrically transitive; then any measurable 
function on the sample space of the æ;’s, that is, any random variable 
measurable with respect to the field of sets determined by conditions on 
the a,’s, defines a strictly stationary stochastic process if the v;s are 
shifted step by step. For example, if the x,, process is metrically transitive, 
the processes 


{%,2,— co <n< co} {%_ + %nye’, O << < oO} 


are also metrically transitive, The corresponding results hold for a one- 
sided parameter range. 

If {x,,— © <n < o} is a process which is strictly stationary, the 
process {x n, — © <n < oo} is also strictly stationary, and has the same 
invariant random variables as the given process, since the inverse of a 
measure-preserving set transformation with an inverse has the same 
invariant random variables as the given transformation, The process 
{æn n = 0} is also strictly stationary, and has the same invariant random 
variables as the given process. This assertion is not obvious and will be 
proved in detail. It will follow that the three processes under discussion 
here are either simultaneously metrically transitive or none is. Obviously 
any random variable invariant with respect to the process {x„, n > 0} is 
also invariant with respect to the process {x,,— œ <n < co}. To prove 
the converse we only need to prove that a random variable y which is 
invariant with respect to the second process is measurable on the sample 
space of the x,’s with n >0, To prove this we remark first that, since 
the random variable y is measurable on the sample space of the x,,’s, to 

` every k corresponds a random variable y,, measurable with respect to the 
field of sets determined by conditions on a certain finite number of ,,s, 
for which 
Pilyo) — yx(o)| > 1/k} <2-* 


(see Il §1). Moreover, if desirable, we can replace y, by T’y, here, 
where T is the shift corresponding to the process {x,, — œ <n < œ}, 
since, using the invariance of y under T, 


P{ly(o) — [T’y, (@)| > 1/k} <2-*. 


$1 GENERALITIES 459 


We, therefore, can even suppose that y, is measurable with respect to the 
sample space of the z,’s with n >0. Since 


lim Y; = y 
ES 


with probability 1, we conclude that y is also measurable on the sample 
space of the z,,’s with n > 0, as was to be proved. A trivial modification 
of this reasoning yields the following result, which we state for later 
reference. If {x,, —00 <n < ©} or {x,, n >O} is a strictly stationary 
process, and if y is an invariant random variable, then, for every n, y is 
measurable on the sample space of tp, %n43," * +- 

Example 1 Markov chains Let [p;;] be an N-dimensional stochastic 
matrix, that is, suppose that 


N 
Pu Z0, 2 Pu = L, 
jz 


Then we have proved in V §2 that there are numbers pı, © * +, py, called 
stationary absolute probabilities, satisfying 
N N 
2 Pi=1, P20, 2 PiPis = Pr 
i= i= 
We shall suppose that the stationary absolute probabilities have been 
chosen so that p; > 0 if / is in an ergodic class. Such a choice is always 
possible. (For example, in V, Theorem 2.4, take the coefficients of the 
linear combination positive for i ¢ F.) Let {x,, — 0 <n < œ} be a 
Markov chain with the given absolute and transition probabilities, so that 


P{x,(0).= i} = pi 
Pfzn(0) = j | 2,0) = i} = Pi 
(with probability 1). The 2, process is then a strictly stationary process. 
Let E be an ergodic class of states, and define 
A = {x(w) € E}. 


Since a,(w) e E implies that x,,,(@) € E, and conversely, neglecting œ sets 
of probability 0, it follows that A is an invariant set. Moreover, 


P{A} = 2p >0. 


If there is another ergodic class, a second invariant set can be defined in 
terms of it, in the same way. This set will also have positive probability, 
and will have no points in common with A. Hence in this case 
0 < P{A} < 1, and we have proved that if there is more than one ergodic 


460 STATIONARY PROCESSES—DISCRETE PARAMETER x 


class the process is not metrically transitive. Conversely, if there is only 
one ergodic class, the process is metrically transitive. To prove this we 
need only prove that every invariant set is of the form {2(w) « E}, and 
this is a special case of the following theorem. 

THEOREM 1.1 Jf {æn n = 0} is a stationary Markov process, and if z is 
an invariant random variable, then z is measurable on the sample space of to. 

In terms of sets, rather than random variables, this theorem states that 
any invariant w set is determined by conditions on x alone. Of course 
ap can be replaced by any x,. Because of the set version of this theorem, 
we can assume, if convenient in proving it, that z is bounded. Actually 
we prove the statement by assuming that z is integrable and proving that 
z = E{z | xo} with probability 1. We have already remarked above that 
the invariance of z implies that z is measurable on the sample space of 
En nyp * ** for every n. Hence, since the x, process is a Markov 


fea Efe |o" * Zn} = BG | 24} 
with probability 1, and, by VII, Theorem 4.3, Corollary 1, 
lim Efz | x,} =z 


ns 


with probability 1. Then 
lim P{|z(w) — Efz | x,}| > €} = 0 


for every ¢ > 0. Since z is invariant, the probability on the left is actually 
independent of n, so that, taking n = 0, we find that 


z= Efz | x} 


with probability 1, as was to be proved. 

Example 2 Processes with mutually independent random variables. 
Suppose that {,, n > 0} is a stochastic process whose random variables 
are mutually independent, It is then strictly stationary i: and only if the 
x,’s have a common distribution. According to the following theorem 
such a stationary process is always metrically transitive. 

THEOREM 1.2 /f'%p, tı, * * + are mutually independent random variables 
with a common distribution, the strictly stationary process {x,, n > 0} is 
metrically transitive. 

Since. an invariant random variable is measurable on the sample space 
of as Tapis ` * * for every n, the only invariant random variables are the 
constants (with probability 1), by the zero-one law (III, Theorem 1.1), as 
was to be proved. This theorem also follows readily from Theorem 1.1. 

Example 3 Moving averages Consider the x,, process defined by 


wo 
Uy =D, Cad mins 


m=—co 


SI GENERALITIES 461 


where > |e? < œ, and the y,’s are mutually independent random 
Lo 


variables with a common distribution function having zero mean and a 
finite variance. (The series then converges in the mean and also with 
probability 1, by III, Theorem 2.3.) The y, process is metrically transi- 
tive, according to Theorem 1.2. ‘Let T be the shift corresponding to this 
process. Then the x, process is also metrically transitive, since it is 
generated by the metrically transitive measure-preserving set transforma- 
tion T applied to xo. 

(b) Wide sense stationary processes Wide sense stationary processes 
were defined in II §8. The concepts connected with strictly stationary 
processes all have their wide sense counterparts. We suppose, as usual, 
that there is a probability measure defined on a Borel field of sets of some 
space Q. In the following we shall use some of the elementary Hilbert 
space geometry developed in IV §2. Let M be a closed linear manifold 
of random variables whose squares are integrable. A transformation U 
operating on the elements of M, taking them into elements of M, is called 
isometric if the following conditions are satisfied. 

IS, U is single-valued, modulo the random variables which vanish with 
probability 1: if æ is an image of v under U, the class of all images of x 
is the class of random variables equal to a, with probability 1. In the 
following the notation Ux will always mean a particular image of « 
under U. 

IS, U is linear: if a and b are constants, and if x, y are random 
variables in WM, 

U(ax + by) = aUx + bUy 
with probability 1. 
IS, U preserves the root mean square norm: 
E{|Uz|?} = Ej{x]}, ve M. 
It is easily verified that IS, is true if and only if 
E{UxUy} = Efxg}, x,y eM. 


If every element of M is the image of some v under U, then U must be 
1-1 modulo the random variables which vanish almost everywhere, so 
that U has an isometric inverse. In this case U is called unitary. 

If T is a measure-preserving set transformation with domain space 
a Borel field F, there is a corresponding isometric transformation of 
random variables, Let M be the closed linear manifold of random 
variables measurable with respect to F, with integrable squares. Then 
the random variable transformation T with domain limited to ®t is 


isometric, and is unitary if T has an inverse. 


462 STATIONARY PROCESSES—DISCRETE PARAMETER ne 


If U is an isometric transformation, and if x is a random variable in its 

domain of definition, then the stochastic process 

{c,,n>0}, t, = Tr, 
is stationary in the wide sense. If U is unitary, the corresponding state- 
ment with parameter range — © <n < œ% is true. 

Conversely let {x,, n = 0} be a stochastic process which is stationary 
in the wide sense. There is then an isometric transformation which 
generates this process in the sense of the preceding paragraph. To see 
this define M as the closed linear manifold of random variables generated 
by the x,’s. Then it is clear that there is one and only one isometric 
transformation U with the property that Ux, = „+1 n > 0, with prob- 
ability 1. This transformation is called the shift. Similarly, if the para- 
meter range of the process is — œ <n < œ, we obtain a uniquely 
determined unitary transformation taking x, into tp} — © <n < ©, 
also called the shift (relative to the process involved). 

We shall now justify considering isometric and unitary transformations 
the wide sense versions of measure-preserving set transformations, and 
measure-preserving set transformations with inverses, respectively, and 
also show how isometric transformations (rather than unitary ones) can 
be dispensed with in this study. 

Let {xp n > 0} or {x,, — 00 <n < œ} be a stochastic process which 
is stationary in the wide sense, and define 


r(m, n) = Efx nën} 
in the second case. In the first case define 
rm, n) = Efi mirn} 
choosing k so large that m + k = 0 and n+ k=>0. The precise value 


of k used is irrelevant because of the hypothesis of stationarity. Then 


obviously r(m, n) = T(n, m), 


and, if tı, * + *, fy are any integers, the matrix [r(t,, t,)] is non-negative 
definite because 


N N 
T ty t))aa; = E{|> ajx} = 0, 
ui 1 

where k is chosen so large that the random variables on the right are 
defined. According to II, Theorem 3.1 or 3.2, there is then a Gaussian 
process {#,, — co <n < cof for which 


E(@,} = 0 
E{@,,2,} = rm, n) 
Efm} = 0. 


NI GENERALITIES 463 


This Gaussian process was constructed explicitly in the proof of II, 
Theorem 3.1, with © space taken as infinite dimensional coordinate space, 
&, the nth coordinate variable, and with the probability measure the 
infinite dimensional measure set up by Kolmogorov. 

As far as any theorem only involving the covariance function r(-, -) is 
concerned, the 2, process can be used instead of the x, process. For 


example, the averages Paice dalle 
n 


n+1 


converge in the mean if and only if the corresponding 2, averages do. 
However, the @, process is simpler than the x, process in many ways. 
In fact, the ĉ,„ process is stationary in both the strict and wide senses, the 
shift (considered a measure-preserving set transformation) is induced by 
a measure-preserving point transformation, the obvious shift of coordin- 
ates, and the isometric shift is unitary. Thus in proving many theorems 
on wide sense stationary processes we can without loss of generality 
assume that the isometric shift is unitary, specifically if we are only 
interested in wide sense theorems (see II §3). 

If U is an isometric transformation, and if ~ is a random variable on 
its manifold of definition, x is called invariant if Ux = x with probability 1. 
For example, the random variables vanishing with probability 1 are 
invariant. If these are the only invariant random variables, the isometric 
transformation will be called metrically transitive in the wide sense. This 
terminology will not be used outside this section however, since it is 
unorthodox, and is introduced here only to clarify the relation between 
strict and wide sense stationarity. 

The random variables invariant with respect to the isometric shift of a 
wide sense stationary process will be called invariant (wide sense) random 
variables of the process. The process will be called (in this section only) 
metrically transitive in the wide sense if its isometric shift 1s metrically 
transitive in the wide sense. 

If {a,, — co <n < œ} is a process which is stationary in the wide 
sense, the process {_,, — © <n < ©} is also stationary in the wide 
sense, and has the same invariant random variables as the given process, 
since the inverse of a unitary transformation has the same invariant 
random yariables as the given transformation. The process Senh n > 0} 
is also stationary in the wide sense, and has the same invariant random 
variables as the given process. This assertion is proved by using the 
same basic ideas as in the strict sense version given in the first half of 
this section, and the proof will therefore be omitted. It follows that the 
three processes involved here are either simultaneously metrically transi- 


tive in the wide sense or none is. 


n>0, 


> z= 


464 STATIONARY PROCESSES—DISCRETE PARAMETER x 


The following is the wide sense version of Example 2. 

Example 4 Processes with mutually orthogonal random variables. 
Suppose that {z,,, n > 0} is a stochastic process whose random variables 
are mutually orthogonal. It is then stationary in the wide sense if and 
only if E{\z,|*} is independent of n. According to the following theorem 
such a stationary (wide sense) process is always metrically transitive in 


the wide sense. 
THEOREM 1.3 If %, y, ` * + are mutually orthogonal random variables, 


with E{|a,|?} independent of n, the stationary (wide sense) process {x,,n = 0} 
is metrically transitive in the wide sense. 
According to the hypothesis the random variables of the process satisfy 


the conditions E(x} = 0 men 
=o m=n 


for some o > 0. Let I be the closed linear manifold of random variables 
generated by the x,’s. If ø= 0, every random variable in M vanishes 
with probability 1, so the process is metrically transitive. If o > 0, any 
æ in M can be written as the sum of its Fourier series in the «,’s, 


w 

x 1 a 

t= Dag, a= geI 
0 


where the series converges in the mean. Then, if æ is invariant, 


o, w 
È apt; = È aty 
0 0 

with probability. 1. Hence, equating coefficients, 
=A=a="'', 

so that æ = 0 with probability 1, as was to be proved. 

2. The strong law of large numbers for strictly stationary stochastic 


processes 


The fundamental theorem of measure-preserving transformations is the 
ergodic theorem. This is usually stated as follows: Let S be a 1-1 
measure-preserving point transformation, and let æ be.a measurable and 
integrable function on the space involved. Then 


Jim a(w) + (Sw) apt ap a(S") 
n>w n+1 


exists and is finite for almost all œ. Since S~ is also a measure-preserving 
point transformation, the above result holds also with S~ instead of S 


§2 THE STRONG LAW OF LARGE NUMBERS 465 


(and we shall even show below that the limit is the same, neglecting 
values on an w set of measure 0). Since measure-preserving set trans- 
formations are slightly more general than measure-preserving point 
transformations, the following version of the ergodic theorem is slightly 
more general than that stated above: Let T be a measure-preserving set 
transformation, and let x be a measurable and integrable function on the 
space involved. Then 

~. e©tTea+:++-+T"2 

lim ——— 

n= n+1 


exists and is finite for almost all œ. The transformation T of this version 
of the ergodic theorem corresponds to S~ in the previous version. This 
version of the ergodic theorem is the useful one for the purposes of 
probability theory, in which the theorem is called the strong law of large 
numbers for strictly stationary processes. Although the probability 
language makes it possible to give an intuitive description of the limit, the 
discussion of §1 shows that otherwise the following statement of the 
theorem in probability language differs only verbally from the measure 
theoretic statement just given. The T of the measure theoretic statement 
becomes the shift transformation of the stochastic process. 

THEOREM 2.1 Let {£p n > 0} be a strictly stationary stochastic process, 
with E{|2|} < 00, and let S be the Borel field of invariant œ sets. Then 


1 Spb ep 
(2.1) Meg Sa "= Efe | 4} 
with probability 1. In particular, if the process is metrically transitive, the 


right-hand side of (2.1) can be replaced by Efaxo}. 

This theorem is sometimes stated using superficially different averages. 
It is clear, for example, that the content of the theorem is not changed if 
the average on the left in (2.1) is replaced by 


Ee H°’ + Geen 
n+1 


(fixed k) and that the limit is unaltered. If the parameter range of the 
process is the set of all integers, we can change the time direction, that is, 
replace a, by w_,, and find that the limiting average of w_,,* * *, to when 
n —> œ exists with probability 1. Since the invariant sets of the inverse 
of the shift are the same as those of the shift, the limit is the seme as that 
in (2.1). It then follows that the average 


egcke Yah ka 
2n+ 1 


466 STATIONARY PROCESSES—DISCRETE PARAMETER x 


has the same limit (with probability 1), when n co However, 


ft oe eee 
smo oA m+ 1 


does not exist with probability 1, in general. 

The most important special case of the theorem is that of mutually 
independent 2,,’s. In this case the stationarity hypothesis becomes the 
hypothesis that the x„’s have a common distribution function. According 
to Theorem 1.2 every process of this type is metrically transitive. Hence 
the limit in (2.1) is E{a} in this case. Two proofs of the strong law of 
large numbers in this special case have already been given, in III (Theorem 
5.1) and VII (§6). 

Before proving Theorem 2.1 we prove three lemmas. 


LEMMA 2.1 Let Cy, - + +, C, be any real numbers, and let E be the 
class of integers m < n satisfying 
(2.2) Cm < Max C;. 
j>m 


The class E consists of groups of consecutive integers; if a, P are the first 
and last integers in such a group, then 


(2.3) C< Coy aj. 
In fact, since 6 + 14¢E, Cz > Max C; (or B +1=n), so that, 
since ĝ e E, wae 
Ch < Mar Cj = Cpr 
If x <f, then B— 1 eE, and it follows that 
(2.4) Ga < ros C; = Coy, 
and so on. 


LEMMA 2.2 Let {w,, n >0} be a real strictly stationary stochastic 
process, with E{|x%~|} < co. Then, if B is any constant, if M is an invariant 


© set of the process, and if S, = 2% +- + + + tp, 
(2.5) | a dP > pPL.u.B, SO > BIM}. 
n>1 n 


x| LU, Sase) 


Replacing the sequence {x,} by {xn — f}, so that S,,/n is replaced by 
(S,,/n) — B, it is clear that it is sufficient to prove the theorem for f = 0. 
Define A and A, by 


A={L.UB.S,(0)>0}, A,={ LUB. S,(w) > 0}, 
n>1 1<n<j 


———————— e e 


§2 THE STRONG LAW OF LARGE NUMBERS 467 


so that A, increases to A when j >œ. Apply Lemma 2.1 to a sample 
sequence of S,,- - -, Sm, and let N; be the w set where j is in the class of 
integers defined by Lemma 2.1, 
N, ={ Max [xo) +: +: +4,(w)] > 0} = TAn 
j<k<m-1 
Then according to the lemma, 
2. x(w) >`0, 
weN, 

where the notation signifies that the sum is to include a) ifoeN;. It 
follows that 

m-1 

> f xdP=0, 

j=0 vin, 
and therefore, in view of the fact that T is probability-preserving, we 
conclude that 


Q6 oc S f TS f mar= 5 | aap. 


J= rA J=0 yA, J=1 yin, 
But 
(2.7) lim i % dP = J zo dP, 
foo MA, MA 


so that, dividing (2.6) through by m, this inequality yields (2.5) (with 
B = 0) when m > œ. 

Lemma 2.3 Under the hypotheses of Theorem 2.1, if the x, process is 
real, the random variables 

5, A siii 
limin =, lim sup =i 

are invariant. i 

We give the proof only for č. We must prove that 

Pinea Ne” 


t+ 
lim inf —— 2 = ği 
n= n 


with probability 1. Now 


. ni x 
imna tn rink tet 
ae pe 
< x. 
= lim ane i ae oo = =F, 


as was to be proved. Note that as yet we have not proved that #, and 
%, are finite-valued. 


468 STATIONARY PROCESSES—DISCRETE PARAMETER x 


We now proceed to the proof of Theorem 2.1. It is sufficient to 
prove it in the real case. If «< £, and if M,,, is the invariant œ set 
{Z (w) < « < B < %,(w)}, then evidently 

S, 
M, s = Ma, g (Lus. Salo) > A}, 


n21 n 
so that, from Lemma 2.2, 


(2.8) fade = J x dP > PPM, 5}. 
Mg, 8 Mao [rup Saw S e) 
n=! s 
Applying this result to {— x,}, replacing æ, B by — f, — a, 
(2.9) | % dP <aP{M,, p}: 
Ma, 8 
Combining (2.8) and (2.9), we find that P{M, ,}=0. Since 
(2.10) P{(w) <#,(w)} = Pf UM., a} (œ, B rational), 
a< 
= > PIM,, a= 0; 
it follows that %, 4, with probability 1. There is thus a finite or 
infinite limit æ. An application of Fatou’s lemma to the averages 
[ol +> + ++ |e 
n+1 
shows that a is finite-valued with probability 1, and integrable, but this 
fact will also be a consequence of the following deeper argument. To 
identify the limit x with Ejay | 4} we must prove that x is finite-valued 
with probability 1, is integrable, and satisfies 


(2.11) [oar = fear 
A A 


for every invariant A. To prove these assertions it is sufficient to show 
that the averages 


To + Smery, + La 
en > 
n+1 RE aise 
are uniformly integrable, because then their limit x will be finite-valued 
with probability 1, integrable, and, because uniform integrability implies 
that integration to the limit is legitimate. 


1 bs 1 n 
% dP = Ba T 
fa ware | se [ike 
A TIA A 


> [aap (n —> œ), 
A 


§2 THE STRONG LAW OF LARGE NUMBERS 469 


if A is invariant. Thus (2.11) is true. We shall show not only that the 
ergodic theorem averages are uniformly integrable but that their dth 
powers are uniformly integrable, if 6 > 1 and if E{|x|"} < 00. To show 
this let £ be a positive number, and choose £, so small that 


fle dP <e if PM} <en j>. 
M 


Since the xs have a common distribution, there is a positive e, with 
this property. Hence 


(2.12) Í 


so that the integrands yield uniformly small integrals over sets of small 
probability. This property implies uniform integrability if it is known 
that the integrands yield uniformly bounded integrals over the whole 
space. The latter property follows from the fact that, if M = Q in (2.12), 
the integral on the right reduces to Ef|æo|’}. If ô> 1, the uniform 
integrability just proved also follows from the following inequality, 
obtained by combining Lemma 2.2, with x, replaced by |x,|, and VII, 
Theorem 3.4’, 


ys ee 
n+1 


: Jeol? +++ + + foal? 
a< f o al dP <e 
M n+1 


if P{M} < £» 


E{L.U.B. [5 |el/(n + DI} +- Elle logt el} 9 = 1 
n>0 0 e—1 e-l 


ô ő 
sajn o 


Finally we remark that, if E{|xo|’} < co and if ô = 1, then 


Eee en n S 
= Lt Elo | 7} =20) 


(2.13) lim ef 
n= 


because going to the limit under the expectation symbol is now justified 
by the uniform integrability of the averages. Thus there is mean con- 
vergence of order 0. 
Theorem 2.1 has the following corollary, which is important in the 
harmonic analysis of the sample sequences of strictly stationary processes. 
CoroLLARY If u is real and — } < u < 4, 


iNet 5 
2.14 iM ere — Fi 
( ) im n+1 Žo x wu) 


470 STATIONARY PROCESSES—DISCRETE PARAMETER xX 


exists with probability 1. The random variable (u) has finite expectation, 
E@@} = E{x} 
Ef} =0, 40 
and is:transformed by the shift into e?""%(u). Unless u is in a certain 
(at most denumerable) set, P{%(u, œ) = 0} = 1. 
If E{|29)?} < 20, then Ef|a(1)|?} < « also, 
1 
Lim. 
ne 
E{@(uy)2(10)} =0, Ha F be 
Note that, if u = 0, (u) becomes the limit in Theorem 2.1. As in the 
1 
APA 
To prove the corollary, let p be a random variable which is uniformly 
distributed in the interval (0, 1), and which is independent of the 2,, 
process. More precisely, adjoin a space to w space, as described in II §2, 
to obtain a random yariable p, uniformly distributed in the interval (0, 1) 
(where now all random variables are functions on the enlarged space) 
with the property that, if A is a Borel set and if A is a measurable w% set, 
then Pipe A, w € A} = Pipe A}P{w e A}. 
For each y the stochastic process 
{epe ritn y> 0} 


is strictly stationary, so that, according to Theorem 2.1, 


le ae 
ne iit = Hu) 


and 


n 1 k+n 
can be replaced by —— , and so on, 
> k p 7 n+l 2 


theorem, the average 


is 1 m Tn R n a 
liye > ae 2 + 5a) = e274 lim > are 2ntie 
nro M+ 1 jy n>a n + 1 j= 


exists with probability 1. Thus (u) is defined, and it is obvious that 


under the shift (u) becomes @u)e*"". We have seen in the proof of 


the theorem that, if E{|x9|°} < oo for some ô > 1, then the random 
variables 3 


1 ð 
[a è e. n=0, 


are uniformly integrable (applying the proof to the |x;|’s rather than to 
the x,’s). Then the random variables 


n ò 
pet, 


ntis CaN, 


§2 THE STRONG LAW OF LARGE NUMBERS 471 


are uniformly integrable, and it follows that 


. E SAT Ai 
tg ieee el a 
since the uniform integrability justifies going to the limit under the 
expectation symbol. Thus in this case there is convergence in the mean 
with index ô. In particular when 6 = 1 we find that the averages in 
(2.14) are uniformly integrable, so that 


; 1 x —2rij, 
E{z()} = lim FS I eee 2 n) 


S E 
E ey ae 
{xo} age a 


n+1j=0 
a E(x}, p=0 
= 0; u #0. 


[The evaluation for u = 0 is of course also implied by the evaluation of 
%(0) given in Theorem 2.1.] 

If E{|w9|2} < co, the x, process is stationary in the wide sense, and 
(2.15) and the mutual orthogonality of the £(j)’s will be proved in §6 by 
another method. The proof of (2.15) has already been given (in fact, we 
have proved a generalized version with any exponent 6 > 1). Since the 
shift is probability-preserving, it preserves expectations, and we conclude 
that, if E{|a9|?} < œ, 

E{E UDÈ} = Ef) Huger} 
= erd ERA} =O pa # My 
so that the #(y)’s are mutually orthogonal. Let 41, H, ° * * be values of 
for which P{%(u;, œ) = 0} <1. Then the sequence 
(zupe j= 
is an orthonormal sequence of random variables; o can be expanded in 
a Fourier series relative to this sequence, 


to ~ Š, EH ED), 


here ay = Efri E] 
and (Bessel’s inequality) 
(2.16) Efl) > > lal. 


472 STATIONARY PROCESSES—DISCRETE PARAMETER x 
Now, applying the shift, 
Eoin} = Efrem) = Efeu) = + + - 


Ji F ; i Å Ar Hm) 


Efl} (n> o0). 
Hence (2.16) becomes 


Eo) = $, EI 


Since the j,’s are arbitrary, there is at most an enumerable p set for 
which Ef{|è(u)|°} > 0, that is, for which P{z(u, œ) = 0} < 1. This would 
finish the proof of the corollary, except that the existence of this u set is 
stated in the corollary without the hypothesis that E{|2o|*} < oo. To 
remove this restriction we note that, if x(w) = x(w) for |x,(w)| <N 
and is otherwise 0, the x‘) process (fixed N) is also strictly stationary, 
with E{\x,'"|*} < oo, Defining Zu) using the zs instead of the 
x,'s we find that, except for an enumerable set F?, P{a(u, w) = 0} = 1; 
the previous result is applicable because a) is bounded. Moreover 


lo. 1 j 
ef E E A ) 
mb nti ° 


1 & f 
<E (z +1 P2 lz; — zj) = E{\z,— sto}, 


-0 
Hence (Fatou’s lemma, or use the uniform integrability) when n — co we 
have 
E{|2(u) — N u)|} < Elro — a9} 


so that, if F = Be, 
Ef} Elro h wd F, N=1,2,-+ +. 


Since the right side goes to 0 when N — co, the left side must vanish, 
and this finishes the proof, 

Let T be the shift corresponding to the x, process, and let z be a random 
variable on the sample space of the z,'s, with Ef|z|} < oo. Then, 
according to the corollary, 


" 1 ~ b 
(2.14) lim 7,2, T atin — 3(y1) 


§3 THE COVARIANCE FUNCTION 473 


exists with probability 1, for all x. A slight extension of the argument of 
the preceding paragraph can be used to show that there is an at most 
enumerable u set G, independent of the choice of z, such that 


P{Xu,0)=O}=1, eg. 


A full understanding of these results cannot be obtained without an 
understanding of the spectral analysis of unitary operators, for which the 
reader is referred to the literature of this subject. 


3. The covariance function of a stationary stochastic process; examples 
In this and the following sections we shall consider stationary (wide 
sense) processes {x,, — © <n < œ}, with particular reference to their 
harmonic analysis, The translation of the results to the case of the 
parameter range n > 0 will usually be obvious. 
In any harmonic analysis it is easier to deal with complex valued 
functions, and series 


La erm 
n! 


rather than with real functions, and series 
L[a,, cos 27nd + b, sin 2nd]. 


For this reason, complex processes will be considered first, and the results 
for real processes obtained by specialization. 
The covariance function defined by 


R(n) = Eft mink} 


plays a fundamental role in the study of processes stationary in the wide 
sense. The following theorem describes the class of covariance functions. 
THEOREM 3.1 The covariance function is positive definite, that is, 


G1) e R(— n) = R(n), 
X R= n)an 20, N=1,2: 4 
m, n=l 
Jor every set of complex numbers %, * * ', y. Conversely any function 


R satisfying (3.1) is the covariance function of a stationary (wide sense) 
process, which can be taken as real if the covariance function is real. 

In II §3 necessary and sufficient conditions were found that a function 
be the covariance function of a process, r(s, 1) = E{,#,}. In the present 
case these conditions reduce to (3.1). The following theorem describes 
the positive definite functions as Fourier transforms. 


474 STATIONARY PROCESSES—DISCRETE PARAMETER X 


THEOREM 3.2 A function R is positive definite if and only if it can be 


expressed in the form m 


(3.2) R(n) = Í 27" AFCA) 

—=1/2 
where F (defined for |A| <4) is monotone non-decreasing. The function 
F is uniquely determined, if suitably normalized, by the equations 


Fla +) + Fa) Fa +) + FAL =?) 
2 


(3.3) 


2 
N ener, p-2rind, 
= RO — A) + lim > Rn) ——_,— 
Now -N — 2rin 
(n#0) 
=} <4 <A, <}, 
F($) — F(— 3) = RO). 
If R is real (3.2) can be replaced by 
1/2 
(3.2’) R(n) = J cos 2nd dG(A), 


0 


where G (defined for 0 < 4 < 4) is monotone non-decreasing and is uniquely 
determined, if suitably normalized, by the equations 


6.3) GA +) + Ga—) GO = 2R()A + 2 S R(n) sin 2nd 
2 mA 5 
0<A<i, 
G) — GO) = R(0). 
If R(n) is defined by (3.2), R(— n) = R(n) and 
N 1/2 N 
> Rm—n)a,,%, = J S anā? M- GEC) 
m, n=1 1/2 m n=l 


1/2 
= Í D ae?e dF(2) = 0. 
=1j2 1 
If b, is the coefficient of R(n) in (3.3), 


Ay 1/2 
by = fe? dh = | e a H(A) di, 
A -1/2 


where 
WA ASA <A, 
=E A= hy do, 
=0, “AA Cor AS, 


§3 THE COVARIANCE FUNCTION 475 


except that, if 2, = — $, A, <4, we take bG) = 4; if 4, >—4, 4, =4, 
we take b(— 4) = 4; and if A, = — 4, A, = 4, we take b(A) = 1. 

Thus b, is the nth Fourier coefficient of the function 5(-). The Fourier 
series for b(-) converges to b(-), and the partial sums are uniformly 


bounded, if the Nth partial sum is taken to be >. Using this fact, we 
N 
find that 


1/2 1/2 


bue dF(2) + | BAd, N>, 


—1/2 


= 


N 
> 5,R@) = 
<N 


LM 


—1/2 


and this yields (3.3). Conversely suppose that R is a positive definite 
function. Then, if «,, = e~?7"" in (3.1), we find that 


~ 1R i z |n| Qmind 
yA = N > Rone MEY, Rof = k) ene 0) 
m,n=1 “N 


The equations express fy(2) as the sum of a finite Fourier series, and we 
have the usual coefficient formula, 


(3.4) 
1/2 
Re as By = f erint £.(2) dÀ 
N -1/2 
1/2 
= | edF), |n| <N 
-1/2 


1/2 a 
o= f edF), n>N, Fy() = | ful de. 
-1/2 —12 


Now Fy is monotone, vanishes at — 4, and has value RO) at A=4. 
Hence by Helly’s theorem there is a convergent subsequence of {Fah 
that is, there is a sequence of values of N, say Ny, Na + * > for which 
N, > œ and Fy (A) > F(A) for all 4. When N —> œ along this sequence 
in (3.4) we obtain (3.2). We have thus proved the theorem in the complex 
case. If F is normalized by adjusting it so that 


F-})=0, FA+)=FQ), $<1 <4, 


transferring the jump at — 4, if there is one, to }, F is uniquely deter- 
mined by (3.3). In particular, suppose that R is real. Then R is even 
and, since replacing F(A) in (3.2) by — F(— 4) replaces R(n) by RO n) 
= R(n), the fact that F is uniquely determined if normalized implies that 


476 STATIONARY PROCESSES—DISCRETE PARAMETER x 


Fy +- Fl) = FHA H)— Fa), 4 <A <A <4 
Then, for any version of F, if G is defined by 
GA = AFA — FO H] + FO+)— FO), 0<4<3, 
=0, A=0, 


(3.2’) will be true. Conversely, given a G for which (3.2’) is true, F can 
be found for which (3.2) is true, say 


F(A) = HG) — GO +) + GO+)— GO), 220, 
=—4G(—a— GO +)], 4<0. 
Equation (3.3’) then follows from (3.3). If G is normalized, say by 
adjusting it so that 
G(0) = 0, GA+)=GAa), 0<4<, 


G is uniquely determined by (3.3’). 

If Fis absolutely continuous, that is, if it is the integral of its derivative, 
the process is said to have a spectral density and F’ is called the spectral 
density function (complex form). If the process is real, F is absolutely 
continuous if and only if G is, and in that case G’ = 2F’ is called the 
spectral density function (real form). In any case, F is called the spectral 
distribution function of the corresponding process. 


2 
If > |R()| converges, there is a continuous spectral density function 
fo 


given by 

(3.5) F(a) = 5 R(nje "> 
which reduces to rs 

6.5) G'A) = 2RO) + 4 5 RA) cos Zani 


in the real case. In fact, this restriction on the covariance function 
implies that (3.5) and (3.5’) can be integrated to give (3.3) and (3.3); 

The spectrum of the process consists of every number 2) in whose 
neighborhood F actually increases, in the sense that 


Flag + 8) > Fly — 8) 


for every e > 0. The numbers in the spectrum are the frequencies which 
enter effectively in the harmonic analysis of R [given by (3.2) and (3.3)] 
and also, as will be seen below, in the harmonic analysis of the sample 
functions of the process. In particular, if the spectrum only contains a 
finite or enumerable sequence of values of A, F increases only in jumps, 
one at each point of the spectrum. 


§3 THE COVARIANCE FUNCTION 477 


Example 1 Suppose that the random variables - - -, £o, y,  - * are 
mutually orthogonal, with E{|x„|?} = o? for all n. Then the x, process is 
stationary in the wide sense, with 


R(n) = 0, n+0, 
(oe —0; 

F0) = 0%; 

all frequencies are equally important. 
Example 2 Suppose that the x, process is a Markov process in the 
wide sense and is also stationary in the wide sense. Then (V §8) 
R(n) = a”R(0), n => 0, jal <1. 
If |a| = 1, 
E{ |x, — a"xo|?} = R0) — a"R(n) — a" R(n) + RO) = 0, 

so that 


En = aX n=0,+1,+>>, 


with probability 1, and conversely any sequence of random variables 

satisfying these equations obviously defines a Markov process (wide sense) 

which is stationary (wide sense). If |a| <1, it is not immediately clear 

that R is positive definite. However, assuming for the moment that it is, 
there is a spectral density which can be evaluated by (3.5), 

o =1 A 

F(a) = [> anes 2rins ae >; a"e~?7i"4) R(0) 

0 —0 

1— |al? 


RO), a A e 
1 + |a|?— 2ļa| cos 27(4— 0) o la 


Now this function of A is non-negative and integrable, and therefore 
defines a spectral density. The corresponding covariance function R 
defined by (3.2) is the given one, which is thereby proved legitimate. A 
second method of proving legitimacy would be the actual construction of 
a process with the desired properties. For example, if é, &, * * * is an 
orthonormal sequence of random variables and if x, is defined by 


ta = > GE, (lal <0); 
j=0 


the x„ process is Markov and stationary, both in the wide sense, with 


R(n) = a"R(0), n>0. 


478 STATIONARY PROCESSES—DISCRETE PARAMETER xX 


Example 3 Let &,+ * +, &, be mutually orthogonal random variables, 
with 
E{[E?} = 07 > 0, 


and let 2, * * *, 4, be real numbers. Define x, by 


k 
Iil. 
cn » Een 5 
j=1 
Then 


k 
Ete at = > ofe i — RM) 
j=l 


is independent of m, so that the x,, process is stationary in the wide sense. 
Since the A,’s can be changed by integral amounts without affecting ~,,, 
it is no restriction to assume that — } < 4; < 4, and we can also assume 
that the /,’s are distinct. With these assumptions the spectral distribution 
function F of the x, process increases only at the points 24, + > +, A, (that 
is, the spectrum only contains k points) with the jump o/ at A;. The 
real process corresponding to this complex process is defined as follows. 
Let th, * * *, Ups Vis * * *, Up be mutually orthogonal real random variables, 
with 
E{u?} = E{v?} = 07 > 0 


and let 24, * * +, 4, be any real numbers. Define x„ by 


k 
Xa = > (u, cos 2nd, + v; sin 2nnd,). 
j=l 


Then 


k 
Eftmintm} = > 0; cos 2and, = R(n) 
j=1 


is independent of m, so that the x,, process is stationary in the wide sense. 
As in the complex case, it can be assumed that 


=} <44, 


but we can now confine the /,’s still further, by replacing any negative 4; 
by |A,|, changing the sign of v; to compensate. Thus it is no restriction 
to assume that 


0<1,<} 


and that the /,’s are distinct. With these assumptions G increases only 
at Ås + + +, Aw increasing by o,? at 2;; the (real form) spectrum has k 
points in it. The corresponding complex form spectrum may have either 
2k or 2k—1 points in it. If A;> 0, F increases by o?/2 at + 4,; if 


§3 THE COVARIANCE FUNCTION 479 


A, = 0, F increases by o; at 0. By a proper choice of the 4,’s and &;'s, 
the complex version of Example 3 reduces to this real one. 
In particular, if the y;s and z,’s are Gaussian, with 


Ety;} = EG} = 0, 


the x, process is stationary in the strict sense. 

It will be proved below that every stationary (wide sense) process is 
either as in Example 3 or can be approximated arbitrarily closely by 
processes of this type. It will then be clear that the harmonic analysis 
of stationary processes must play an essential role in their study. 

Example 4 Let & be any real random variable for which — 3 < £ < 3, 
and let a be a constant. Define x, by 


ain 


Tp = ne 


Then 
1/2 


Efe minËn} = laje} = | e dF) = Ro), 
—1/2 


where F/|a|? is the distribution function of £. The, process is stationary 
in the wide sense, with spectral distribution function (complex form) F. 
This example exhibits a stationary (wide sense) process with any pre- 
assigned spectral distribution function. Note that the sample sequences 
are in every case periodic (thinking of n as continuous) whereas in Example 
3 this is not true, if k > 1, except in very special cases. Thus the spectral 
distribution function does not determine the harmonic analysis of the 
individual sample functions except in some sort of average sense. The 
real example corresponding to this complex one is the following. Let 
Y, z be real mutually independent random variables, with 0 < y < } and 
z uniformly distributed in the interval from — $ to 4; let ø be a real 
constant. Define x, by 
x, = 6 cos 2n(ny + 2). 


Then 
2 1/2 
E24 89} = F Efcos 2nny} = | cos 2nnd dG(A) = R(n), 
0 


where 2G/o? is the distribution function of y. The v, process is stationary 
in the wide sense with spectral distribution function (real form) G. 

The sample functions in this example are elementary trigonometric 
functions, and what randomness there is in the process is obtained by 
picking values of frequency (y) and phase (z) which determine the sample 
sequence completely. In general, one picks the whole sample sequence 


480 STATIONARY PROCESSES—DISCRETE PARAMETER x 


out of a hat, that is, one determines values for ' ` *, %4, To, Yy ` °° in 
accordance with their individual and joint distributions. Although it 
may seem at first sight that there is something less random about the 
process under consideration because a choice of only two sample values 
ylw), z(w) determines the full x„ sample sequence, this impression is false. 
In fact in the most general case, picking a sample sequence is equivalent 
to picking only a single random variable sample value, simply because 
infinite dimensional space can be mapped on one-dimensional space in 
such a way that random variables, say the x,’s, become functions of a 
single random variable x, so that picking x automatically determines 
values for all the a,’s. This point is worth more discussion since it is a 
cause of confusion. From our general point of view sample sequence 
probabilities are probabilities in œ space, so that choosing a sample 
sequence means choosing a single point in this space, a single choice. 
This does not alter the fact that after - > -,x_,(w), x(w) are all chosen in 
accordance with the relevant probability distributions, 7,(w) may or may 
not be uniquely determined as a function of these chosen values. The 
following example is instructive. Let œ space be the real interval [0, 1], 
and let probability measure be Lebesgue measure on this interval, Let 
æ be the coordinate variable of the interval. Then we can write 


slw) = 0 =. 88s, 
the decimal expansion of œ. If we define the random variable a, by 
alw) = Èn 


a, becomes a single-valued function of x, neglecting certain rational 
values of œ which form a set of probability 0, The «,’s are mutually 
independent random variables, with 


Pizo) =f} = f= 9° 49 


A choice of x(w), za(œ), + + + thus requires, from one point of view, 
infinitely many mutually independent choices (say by throwing a ten-sided 
die), But from another point of view all the x, (w)’s are uniquely deter- 
mined by a sample value of one random variable x. 
Example 5 Suppose that x, is defined by 
1/2 
a= |e dua), 


-1/2 
where the (4) process has orthogonal increments, with 
E{|dy(4))?} = dF). 


§4 THE SPECTRAL REPRESENTATION 481 


Then (IX §2) 
1/2 


Elpentn} = | ee dF) = Ro) 


—1/2 


is independent of m, so that the a, process is stationary in the wide sense 
with covariance function R and spectral distribution function F. It will 
be shown in §4 that every stationary (wide sense) process can be repre- 
sented in’ this way, and that in the real case the representation can also 


be put in the form 
1/2 


eta cos 2and du(A) + sin 2nd dv(A), 


where the u(2), v(A) processes are real, with orthogonal increments, and 
E{|du(A)|?} = E{|dv(A)|?} = data) 
E{du(A) do(i)} = 0. 


Here G is the spectral distribution function (real form) and the last 
equation is a symbolic version of the statement that every u(A) increment 
is orthogonal to every v(2) increment. Example 3 above is the particular 
case in which F is constant except for a finite number of jumps, so that 
the spectrum of the process only contains a finite number of points. 
This representation of x, is called the spectral representation (real or 
complex form, as the case may be). 


4. The spectral representation of a stationary process 

In the following theorem, and in certain later ones involving the 
spectral representation of a stationary process, the parameter set is always 
taken as the set of all integers, rather than, for example, the set of positive 
integers. We omit the extension to the latter case, which would carry us 
too far afield. The theorems are not always true in the latter case without 
certain modifications, but the essential character of the sample sequences 
is the same. 

THEOREM 4.1 Every stationary (wide sense) process {x,,— 00 <n < oo} 


has the spectral representation 
1/2 


(4.1) te f erin’ dy(A), 
—1/2 


where the y(A) process has orthogonal increments, and 


E{|dy()?} = dF). 


482 STATIONARY PROCESSES—DISCRETE PARAMETER X 


If a y(A) process with orthogonal increments satisfies (4.1), with a suitable 
evaluation of the right-hand side, then 


Yds +) +y) yy +) +y —) 
2 2 


= t(l — fallar iey SA 


eo) 


(4.2) 


en 2rinks __ g~2nind, 


—e 


— 2nin 


(3 <4 <4 <) 
Yd) — y(— }) = to 


and these equations determine y(A) uniquely, neglecting values on an w set 
of probability 0, if the y(A) process is properly normalized. 
If the x, process is real, (4.1) can be put in the form 
1/2 
(4.1’) one | cos 2nnd du(A) + sin 2mnd do(A), 
0 


where the (real) u(A) and v(A) processes have orthogonal increments, with 
E{(du(A)P} = Ef{do(A)} = dG), A> 0 
E{[du(A)P} = dG) 120 
E{du(2) du(u)} = 0 <i, ug}. 


If u(A), vA) processes with orthogonal increments satisfy these conditions 
and (4.1), with a suitable evaluation of the right-hand side, then 


2 = N 
(4.2) ulh +) + u(A—) (0) = 2m 4 Li im. 3 z sin 27n) 
2 mn 
7 a) 
(0 <4 <4) 


wd) — (0) = 2 
Ug +) + (A, —) oA, +) + vA, —) 
2 


2 
N = 
asthe 0S) ef cos 27nd, — cos 27nd, O< te <}) 
Noo -N mn 


and these equations determine u(A), v(A) uniquely, neglecting values on an 
« set of probability 0, if the u(A), v(A) processes are properly normalized. 
We have already seen (§3, Example 5) that any x, process defined by 
(4.1) or (4.1’) with the stated conditions on the (A), u(A), (A) processes 
is stationary in the wide sense. Note that (complex form) if E{dy(A)} = 0, 


§4 THE SPECTRAL REPRESENTATION 483 


then E{z,,} = 0 for all n and [by (4.2)] conversely, and that the x, process 
is Gaussian if and only if the y(2) — y(0) process is Gaussian. Similar 
remarks hold for the real form. 

The proof of the theorem is in part the stochastic analogue of that of 
§3 Theorem 3.2, and the notation of that proof will be used. Since the 
Fourier series for b(-) converges boundedly to b(:), it also converges in 
the mean, with any å weighting. Hence if x, is defined by (4.1), 

1/2 


N N 
Lim. > b,£, = Li.m. | [ D baeri] dyl) 
; a 


Noo -N Nx 
~1/2 


1/2 £ 
p,e] dy) 


2 | BA) dy) 
—1/2 


and this yields (4.2). Here the 1.i.m. [ ] under the integral sign is a 
mean limit with 2 weighting dF(A), and the justification for putting the 
lim, under the integral sign was given in IX §2. Conversely suppose 
that an a, process is stationary in the wide sense. It is not difficult to 
show that the mean limit in (4.2) exists, and this leads to a definition of a 
y(A) process satisfying (4.2). This y(A) process yields the spectral repre- 
sentation (4.1). Rather than giving the details of this approach, which 
are surprisingly fussy if the spectral distribution function of the x,, process 
is not continuous at + 4, we outline a procedure due to Cramér which is 
applicable in more general settings. Let M, be the linear manifold of 
linear combinations of the z/s. Define the inner product of two random 
variables a, y as E{xj}, and the distance between them as 


[Efe — y|? 


Let M be the closure of My with this distance definition. Let Mp’ be the 
linear manifold of linear combinations of the functions ea: 
— co <n < œ, on the interval [— $, 4]. Define the inner product of 
two functions ®, ¥ on this interval and the distance between them as 


1/2 1/2 
AT l 

i} DAFA) dF), [ | pa- YA]? ara|”, 

—=1/2 —1/2 

respectively. Here F is the spectral distribution function of the 2, 

process. Let W be the closure of Mio’ with this distance definition. 


484 STATIONARY PROCESSES—DISCRETE PARAMETER x 


Then W is the class of functions ® on [— $, 4], which are measurable 
telative to F measure, with 

1/2 

J |@(A)|2 dF) < ©. 


1/2 


In the following, we consider two random variables identical if they are 
equal with probability 1, and two functions on [— 4, 4] identical if they 
are equal almost everywhere (F measure]. We shall assume that F is 
continuous on the right, as we can without restricting the problem. We 
shall define a 1-1 correspondence between the elements of M and W 
which preserves inner products, and therefore also preserves distance. 
To do this, define v, as the random variable corresponding to erin’. and 
more generally 5 bX, is to correspond to > b,e?””" (finite sums). Since 
n n 


EE Dyitm D Cn = > bmi R M= n) 
1/2 TTA 
= f E CE 
Eiri ks 
this correspondence, defined as yet only on Wp and My, is 1-1 and 
preserves inner products and distance. The correspondence is then 
extended by continuity to M and WM. Let y() be the random variable 
corresponding to the function which is 1 on the interval [— $, 4], and 
vanishes otherwise. The process {y(A), — $ < 4 < $} will be shown to 
yield the representation (4.1). The fact that inner products are preserved 
in the correspondence defined above implies that the y(A) process has 
orthogonal increments. The fact that distance is preserved implies that 
E{| dy(a)|?} = dF (4). For each A, the random variable (A) lies in W. 
Hence any linear combination of the y(A)’s lies in W, and more generally, 
according to the definition of stochastic integral in IX §2, 
1/2 
| ODM, it Dem. 


—1/2 


We shall prove that this stochastic integral is the element of M corres- 
ponding to the integrand ® in W. In fact this assertion is true by 
definition if ® is the characteristic function of an interval [— 4, A]. Since 
the class of ®’s for which it is true form a closed linear manifold, con- 
taining these characteristic functions, it contains e?%™* for every n, by an 
elementary approximation, and therefore this class is W itself. Finally, 
according to the definition of the correspondence, x, corresponds to 


§4 THE SPECTRAL REPRESENTATION 485 


"4 and this fact is now one way of indicating the spectral representation 
(4.1). 
If the y(A) process satisfies the condition 
Piy(— $, 0) = O}=1 
Piy +, o) = yA, o) =1, —$<S4<4 


(corresponding to the condition that the spectral distribution function 
vanish at — 4 and be continuous on the right), y(A) is uniquely determined 
up to values on an w set of probability 0 by (4.2). If the y(4) process does 
not satisfy this condition, y(2) can be replaced by 


na =0 t=-4 
=yA+)—y¥O+) —3<A<t 
= To A=} 


to obtain a process satisfying this condition and for which (4.1) remains 
true. Because of the ambiguity in the definition of the stochastic integral, 
there is equality in (4.1) only for one particular version of the stochastic 
integral; for other versions, there is only equality with probability 1. 

If the x, process is real, the equality x, = ën combined with (4.1) 
implies that, whether the y(A) process is normalized or not, 


Piy +, w) — y — w) = Y Ay +, 0) — y— h —, w) = 1, 
—$<4<4<h4, 


from which the representation (4.1’) is easily derived. The inverse 
formula (4.2’) is then derived by taking real and imaginary parts in (4.2). 
Note that the dv(A) integrand in (4.1’) vanishes at 0 and 4 so that, if 0 
or } is a fixed point of discontinuity of the v(2) process, the discontinuity 
makes no contribution to the stochastic integral. If the u(A) and v(A) 
processes satisfy the conditions 


P{u(0, w) = v(0, w) = 0} =! 
Piv, w) — v4 —, w) = v0 +, w) — 0, o) = 0} 
P{u(A +, w) = u(A, w), và +, o) = vA, w)} = 1, 0<4A <}h, 


then u(A) and v(A) are uniquely determined up to values on an @ set of 
probability 0 by (4.2’), and this normalization can always be effected by 
trivial manipulations, as in the complex case. 

If the spectrum of the 2, process only contains a finite number of 


486 STATIONARY PROCESSES—DISCRETE PARAMETER X 


points, say if F is constant except for jumps of magnitude asa E Ay 
at 2,,° © “s Ag, the spectral representation reduces to §3 Example 3, 
1/2 1 
t= | 27" dy(2) = > erini y, 
-1/2 j=l 
where 


Y = yA; +)— yA; —)- 

The process of Example 3 approximates the general case arbitrarily closely 
in the following sense. The integral 

1/2 

| erin dy(J) 

—1/2 
is the limit in the mean, for each n, of the appropriate Riemann-Stieltjes 
sums, 


> ein YCA) — YA), 
j 


when Max (;,,; — å) +0; each sum defines a process of the type of 
Example 3 [with y; = yj.) — y(A))I- 

In most practical applications E{z,} is independent of n. In particular 
we have seen that, if E{x,} = 0 for all n, then E{dy(7)} = 0 and the y(A) 
process has uncorrelated as well as orthogonal increments. In general, 
if E{«,,} =m is independent of n, the process with random variables 

+ +, &o— mM, %,— m, « - > is also stationary in the wide sense and it is 
convenient to use the spectral representation of x, — m rather than that 
of x,. There is actually very little difference between the y(A) processes 
involved; we have 


1/2 1/2 
z, = | ebrin’ dy (A), x, —m = f erin dy(2), 
—1/2 1/2 


where y(2) is the same as y,(4) except for a jump of magnitude m when 
A= 0, 


yA) = yA), A4<0, 
= y,(A) + m, A>0. 


5. Spectral decompositions 


Let {x,, — © <n < œ} be a stationary (wide sense) process, and 
suppose that A}, * * +, A, are v disjunct sets in the interval [— }, 4], whose 
union is this interval. It is supposed that these sets are measurable with 


respect to the F measure f aFQ). Then it is possible to exhibit the x, 
A 


§5 SPECTRAL DECOMPOSITIONS 487 


process as a sum of mutually orthogonal stationary (wide sense) processes 
whose spectral distributions are confined to the sets A;,-* -, A, This 
is done as follows: If x, has the spectral representation 
1/2 
an= | edy), — Ef|dy()|*} = dF), 
3/2 
define x, by 
1/2 
Bp =| ETA DA dyla am 


1/2 


where ®,(A) is 1 on A, and O otherwise. Then the x,” process is stationary 
(wide sense), and has spectral distribution function given by 


a 
FQ) = | Du) dF(u). 
—1/2 

The F; distribution is thus confined to A;. Moreover, 

1/2 

Efex, 2} = if e2ri(n—m)A DADA) dF(A) = 0, j#k. 

1/2 
The above procedure is also applicable if the A;’s are (denumerably) 
infinite in number. Note that, if E{x,}=0 for all n, it follows that 
E{x,,} = 0 for all n also, since then E{dy(A)} = 0. 

The decomposition we have described here can be effected linearly. 
To show this, we show that, if (fixed j) a-y, * * *, ay are chosen properly, 
i 


the sum > a,2,,, will approximate x,” arbitrarily closely in the mean 
boty 


Square sense. Since 
N ae N 
2min? rik 
Gly e ae dy(A), 
pe sis ent k ite ste k 
it need only be shown (see IX §2) that, with a proper choice of the a;,’s 


N . 
the sum 5 q,e27 will approximate ©,(@) arbitrarily closely in the 
ko-y 
mean square sense with 2 weighting dF(A), 
aif 


/2 N 
| 0D- È mee aro ~ 0. 
1/2 kt=-N 


This is true if the set A, is an interval whose endpoints are not points of 
discontinuity of F, because in that case the partial sums of the Fourier 


488 STATIONARY PROCESSES—DISCRETE PARAMETER x 


series for ®, converge boundedly to ®,, neglecting the endpoints of A; 
which have F(2) measure 0, The class of functions ®, measurable with 
respect to F measure, with 


1/2 
J |(A)|? dF(@) < 00, 


—1/2 


contains a subclass of functions which can be approximated in the mean 
square sense, with 2 weighting F(A), by trigonometric sums. This sub- 
class is obviously a closed linear manifold, and we have just seen that it 
contains the characteristic functions of intervals whose endpoints are not 
points of discontinuity of F. The subclass is therefore the whole class, 
and includes in particular the characteristic function of A;, as was to be 
proved. 

Certain special cases of this decomposition are very important. Let F 
be the given spectral distribution function. Then we can write F in the 
form 

F=F,+F, + Fs 


where F; is the jump function of F, which increases only at the jumps of 
F by the amount of the jumps, F, is the absolutely continuous com- 


ponent of F, 
A 


FA) = | F'O) dp, 
0 

and the remainder F}, continuous and monotone non-decreasing, is the 
continuous singular component of F. The F, distribution is confined to 
the points where F is discontinuous; the F, distribution is confined to 
the continuity points at which F’ is finite; the F, distribution is confined 
to the remainder, the set of measure 0 where F is continuous and F’ 
either does not exist or is + 00. Thus the decomposition of F implies a 
corresponding decomposition of the x, process into three mutually 
orthogonal processes. The first reduces to a simple sum: If Ay, Ag, + * * 
are the discontinuities of F supposed continuous on the right, the random 
variables y(A,) ~ y(A, ~) form an orthogonal set, and 


2, = > eri (y(A,) — y). 
ł 


The series converges in the mean (Riesz-Fischer theorem). This is 
simply a somewhat extended version of §3, Example 3. 

It will be seen in Chapter XII that the problem of least squares linear 
prediction involves separating the 2:,,'?) out of the x, process. 


§6 THE LAW OF LARGE NUMBERS 489 


6. The law of large numbers for stationary (wide sense) processes 
Let {x,, n > 0} be a stationary (wide sense) process. Then we shall 
prove that 


(6.1) Li.m. 


exists: this is the law of large numbers for these processes. Before 
giving the proof we outline an indirect proof which has educational 
(although no other) value. According to II §3 there is a Gaussian @,, 
process not necessarily defined on the same measure space, for which 


Eê} = 0 
Elamin} = Eine} mn Zo 
E{@,,2,} = 0. 


The @,, process is strictly stationary and is therefore subject to the (strong) 
law of large numbers, Theorem 2.1, according to which 


n 
lim —— > &; 
no ho, 

exists with probability 1. Since the random variables involved are 


Gaussian, it is easily verified that the averages cannot converge unless 
they also converge in the mean. But then the Li.m. in (6.1) also exists 
because the x, and @, processes have the same covariance function, and 
only the covariance function is involved in checking the existence of this 
lim. This proof is hardly the best from any point of view, but it illus- 
trates the fact that the theorem in question is the wide sense version of 
the strong law of large numbers for stationary processes and illustrates 
the close connection between wide sense and strict sense theorems in 


general. 
THEOREM 6.1 Let {x,, — © <n < ©} be a stationary (wide sense) 


process with spectral representation 


(6.2) elie dy) alata 

Then 

6D dim, — b= ¥O- WO, 

and 

(64) E{ly)— 0} = FO) — FO) = lim yy EAD 


490 STATIONARY PROCESSES—DISCRETE PARAMETER x 


The random variable y(0)— y(O—) is invariant under the shift and is the 
projection of x on the closed linear manifold M, of invariant random 
variables (wide sense). 


y(0) — yO —) = Bfay | M}. 
Note that the limit will be 0 if, for example, lim R(n) =0. Except 


for the fact that there is more freedom in the averaging in (6.3) than in 
(2.1’), Theorem 6.1 is in all detail the wide sense version of Theorem 2.1. 
If the a,’s in Theorem 2.1 are mutually independent, corresponding to 
the wide sense statement that those in Theorem 6.1 are mutually orth- 
ogonal, exactly corresponding proofs of the strong and weak versions 
have been given in VII §6 and IV §7. 

The proof of (6.3) is very simple, since the averages can be evaluated 
explicitly: 


1/2 


1 n 1 
ilias 2mijà 
Saas st pea a cad ay?) 


1/2 


1/2 mim 2ri(n—m+1)A 


f ee. Meer ‘ 
EA TE E A, 
I SrA par — Wt) 


and similarly 


f eèrimà q — gPri(n—m+ Yh 


E peiSet a), 
2 


- 5 RU)= 
j=m 


n—m-+1 


The integrand. is <1 in modulus and converges to f(A) when n > œ, 
where 


fO=0 ’A#0 
Sb ee 0; 
Since the convergence is bounded convergence, it is also convergence in 
the mean with 2 weighting dF(A). Hence (cf. IX §2) 
1/2 


$a = ar SO) dy(2) = yO — yO). 


lim. ———— 
n= im y Ijan 


Since the convergence is bounded convergence, 


1/2 


È RO= | AO dD = FO- FO~, 


—1/2 


and this finishes the proof of (6.3) and (6.4). 


lim — 
noo N— a byez 


$6 THE LAW OF LARGE NUMBERS 491 


The following is the wide sense version of the corollary to Theorem 2.1. 
COROLLARY If—}<p<h, 


ye = y(u) — yle —); 


moreover, 
n 


1 A 
E 2 Fl Fi i 5 po- 2rijn 
COR a O = lim Š, RÇ je er". 


To prove the corollary we need only apply Theorem 6.1 to the process 
{ae 7" — o <n < o}. 


This process is stationary in the wide sense, with covariance function 
given by R(n)e™?™'"" and spectral distribution function given by F(A + u) 
(mod 1 in the argument). 

The theorem and its corollary are applicable to stationary (wide sense) 
processes whose parameter range is the set of non-negative integers, as 
far as the existence of the indicated limits goes, since this existence can 
be expressed in terms of the covariance functions of the processes. More- 
over, the limits whose existence is asserted in the corollary are mutually 
orthogonal for different values of i, since this orthogonality can also be 
expressed in terms of the covariance functions. However, the evaluation 
of the limits in terms of the spectral representation must be sacrificed. 

The theorem and corollary show how, by means of a linear operation, 
the discontinuities of F and the corresponding component of the «,, 
process (the 2," process in the notation of §5) can be identified. In 
practice, of course, there is no sampling method (based on finite samples) 
that can do more than indicate vaguely that any particular 4 value is a 
jump point of F. It is important conceptually to note that in the corollary, 
if n = 0, the jumps are evaluated in terms of the past of the process, that 
is, in terms of æ for m<0. Suppose, for example, that the x, process 
has a spectral distribution confined to a finite or infinite sequence of A 
values (so that in the notation of §5 the x,'°) and x,) processes are 
absent). Then according to Theorem 6.1 and its corollary the y(A) 
process [which is made up by summing the jumps yu) — y(u—)] is 
completely determined by the x„’s forn <0. It follows from the spectral 
representation of x, that x, itself must be determined uniquely for all n 
by the z,’s for n < 0, and determined linearly, that is, each x, for n > 0 
can be approximated arbitrarily closely in the mean by linear combinations 
of x,’s with n<0. That this is not true in general is made clear by 
noting that it is not true, for example, if the z,’s are mutually orthogonal. 
(These questions will be studied in great detail in XII.) The fact that 


492 STATIONARY PROCESSES—DISCRETE PARAMETER x 


this is not true in the general case explains why the y() process cannot 


always be expressed linearly in terms of the x, *s for n< 0 and why 
N N 0 


therefore the sums 2: in (4.2) cannot be modified to be sums >’ or 2. 
THEOREM 6.2 Under the hypotheses of Theorem 6.1, if — + = u = $ 

then 1 

li pe —2aijn eS, 

e Ša, j tas 

with probability 1, if there is a positive K and a positive « such that the 

following equal expressions are bounded as indicated: 


2 
ear ie 5 RODO 


ERR dll ) Rectal 
7,2, (1 Prev fine 


1 sin? mn + Ia— 
(n + 1)? sin® aA — s 


e[l- es Swen 
+19 


) GF) = £ 


-1/2 


The equality of these expressions is easily checked, and the proof will 
be omitted. Note that the expressions are bounded [by R(0)] for every 
process; the restrictive hypothesis of the theorem lies in the first place 
in the assumption that the expressions go to 0 when n — œ (which implies 
that the mean limit in the preceding corollary is 0), and in the second 
place in the assumption that the approach to 0 is as fast as K/n”%. The 


condition of the theorem is satisfied for all u, with « = 1, if > |R(n)| < 0; 
0 


it is also satisfied for all u if |R(m)| <const./n*. In particular, the con- 
dition of the theorem is certainly satisfied for all u, with « = 1, if the 
xs are mutually orthogonal. The theorem has already been proved in 
that special case (IV, Theorem 5.2). 

To prove the theorem when u = 0, choose f so large that fa > 1. 
Then, if n > m’, 


1 n 2 
ef ma ) < 
so that, if e > 0, and if ,, is the smallest integer > m’, 
o nm OMUR 
2e 10 fo|>e}< 3 a< o 


Hence 


oe 


im +1 2. 2{o) 


§7 THE ESTIMATION OF R(v) AND F(A) 493 


for sufficiently large m, with probability 1 (Borel-Cantelli lemma), and 
since e is arbitrary it follows that 


m+ 


(6.5) lim > 7%=0 


mo Ms Lo 


with probability 1. Moreover, 


if 1 n 
ef Max — > 2;— 
0 


Nm EN<Nmtr 


Rm +1 


1 Rntr 
PED al 


RO) 
<e Pmi An? <4 
for suitable K,. It follows, as in the argument just used, that 
n nm 


I 1 
6.6) li ;— S A E E E 
6.6): dit ae mi 


with probability 1, and, ‘since the second factor 1/(n + 1) in (6.6) can be 
replaced by 1/(nm + 1) (because their ratio approaches 1 when n —> 0), 
(6.5) and (6.6) combine to give the conclusion of the theorem when 
u=0. The general case can be reduced to this one by the trick used 
in the proof of the corollary to Theorem 6.1. 


7. The estimation of R(v) and F(A) from sample sequences 
It is natural to take as an estimate of R(v) the average 


iS gee = 
È V4 ite 


ee pe 


The following theorem justifies this estimate, at least for large n. In this 


section v is fixed and X, = %+n%n- . 
THEOREM 7.1 Suppose that the x, and X, processes are both stationary 


in the wide sense, that is, that 
Ellen} <0, Elltrinta <2, m= GEL 
and that 


EC nmin)» EG nt mEnt mb nön) 


494 STATIONARY PROCESSES—DISCRETE PARAMETER xX 


are independent of n, for all m = 0, + 1, > > +- Then 
3 1 = 3 
i e Se 
exists. This limit in the mean is R(v) (with probability 1) if and only if 
1 
n+lj 


n 
(1.3) lim È Eleria) = [RO 


If for some positive K and a. 


(raj OS (1— Ul eee ana —lRop<~, 
OY nell gens n+1 Gi Y me 
DEST OHS Oly 


then 
1 n 


(1.2') lim E 2 Za = R0) 


with probability 1. 

If the x,, process is strictly stationary as well as stationary in the wide 
sense, the limit in (1.2) with m = 0 also exists as a limit with probability 1. 
In particular, this is true if the x, process is real and Gaussian, with 
Efx} = 0 for alln, and in this case (7.2’) is true with probability 1 for all 
v if and only if 


1 n 
i — j)\2 = 0. 
(1.5) lim A PA |R(j/)|? = 0. 


This condition is equivalent to the condition that the spectral distribution 
function of the x,, process have no discontinuities, 

According to this theorem the estimate (7.1) of R(») [although it always 
has the correct expectation R(»)] is only “consistent” in the usual 
statistical sense, that is, asymptotically equal to R(»), if specific restrictions 
are imposed on the x, process, The restrictions have the basic effect of 
decreasing the influence of the past of the x, process on the future. 

To prove the theorem we note that according to its hypotheses 


B(X,} = Eft} = RO) 


is independent of n. Hence the process with variables {X, — R0v)} is 
stationary in the wide sense and has zero expectations. It now follows 
from Theorem 6.1 that 


È Fst; — RO) 


1 n 
hime > [X RO)] = Lim. 
oe 2 red o) a n— m+ ljm 


nao N—M + ljn 


§7 THE ESTIMATION OF R(v) AND F(A) 495 


exists. Moreover, according to Theorem 6.1 [see (6.4)] this limit is 0 
with probability 1 if and only if the Césaro limit at oo of the covariance 
function of the X,,— R(») process is 0. This condition is (7.3). Theorem 
6.2 applied to the ¥„— R(») process gives condition (7.4) as a sufficient 
condition for (7.2’). If the x, process is strictly stationary, the X, 
process is also strictly stationary; hence, since E{| X,,|}< E{|x9|*}, the 
strong law of large numbers is applicable to the X,, process. That is, 
the limit in (7.2’) exists with probability 1. It may or may not be R(v); 
for example, it is R(v), according to Theorem 2.1, if the x, process is 
metrically transitive. In particular, if the v, process is real and Gaussian, 
it is strictly stationary if E{a,} = 0 forall n, and if it is stationary in the 
wide sense. If these conditions are true, (7.2’) is true if and only if (7.3) 
is true, and in this case the expectations can be evaluated; (7.3) becomes 


(7.6) lim > a 2 [RG + RUi— »)R(j + »)] = 0. 


If this condition is true for » = 0, (7.5) is satisfied. Conversely, if (7.5) 
is satisfied, (7.6) is also satisfied, since by Schwarz’s inequality 


neste > [RU + RG— RG +») 


S RUP + S R( j)? 


2+ 
<7 a+ m oe 


=P A Rj) + o(1). 


The above discussion is entirely symmetric in n, so that the average in 
0 
(7.2’) can be replaced by =i 5 . The final statement of the theorem 
jn 
is implied by the following theorem: 
THEOREM 7.2 Jf {R(n)} is any covariance sequence, that is, any positive 


definite sequence, 
1/2 


(7.7) R(n) = i eri" dF(A), 
—1/2 
then 


(7.8) Jim — ae [R| = ZEU +- FA—)P. 


496 STATIONARY PROCESSES—DISCRETE PARAMETER x 


In fact, 


1 n 1 a? sy 
R) = erd dE(A) dF( 
ae, O id, LÈ (2) dF) 


1/2 12 1 enta- 
— ei- uy 
J, e CE D e] 


Repeating an argument already used, since the integrand converges 
boundedly to f(A, u), which is 0 except that f(/, 2) = 1, it follows that 
1/2 1/2 
1 n 
ine Rj)? = A, u) dF(2) dF 
r = J | fe mare aru 


—1/2 —1/2 


F(A) dF(u). 


1/2 
= Í [FQ +) — FO —)] dF) 2 [FG +) F(A—)P. 


1/2 


The sum on the right is of course effectively an enumerable sum, since 
“F(A +) — F(A—) > 0 on at most an enumerable A set. 

Now consider the problem of estimating the spectral distribution of a 
stationary process from sample sequences. One way is to estimate a part 
of the covariance function and to use this to evaluate the spectral distri- 
bution function. The following theorem shows the possibility of the 
direct approach. 

THEOREM 7.3 Suppose that the x,, process is stationary (wide sense) and 
that for every v 

n 


fe) i spy ep 
09 kae 


with probability 1. Then, if u and p are continuity points of the spectral 
distribution function F, 


i 1 on grish 
GO im f ial 3 ae 2 da Fy- Ru) 


with probability 1. The limit is uniform in any closed interval of continuity 
of F, with probability 1. If the limit in (1.9) exists only as a limit in 
the mean (or even only a limit in probability), the limit equation (7.10) will 
still be true as a limit in probability. 
To prove this define ®,„ by 
a 


t jos ay (2 
©,(2) = | eile neti du. 


=1/2 


§7 THE ESTIMATION OF R(v) AND F(A) 497 


Then, if |r| <n, 
1/2 
i} e" dO (1) = 
—1/2 


Hence, from (7.9), 


1/2 


n-v 
n i= fat Nat | 


1 n-¥ 
lim e” dO (a) = lim —— > 2;,,4. 
no le nA) noo n + 1450 see 
S 1 nr 
= lim —— 2 240%; 


noo Nn— Y +1520 
(2 
= RO) = | e" dFQ), 


—1/2 


with probability 1. Thus the Fourier-Stieltjes coefficients of ®,, converge 
to those of F. It follows (Lévy continuity theorem for monotone 
functions defined in closed finite intervals) that ®„ —> F + const. at all 
the continuity points of F, for almost every sample sequence. This 
proves the first part of the theorem. If only p lim is supposed in (7.9), 
p lim can be derived in (7.10) by going to convergent subsequences, 
Note that it is incorrect to deduce from (7.10) that 
2 


n zi 
> gjeri 
j=0 


can be used to approximate the spectral density F(A). In fact, as a 
counterexample suppose that the «,’s are real and Gaussian, and mutually 
independent, with E{z,} = 0, Efx} = 1. In this case 


R(O) = 1 
R(n) = 0, n#0, 


The approximant ®,/(0) becomes the square of a real Gaussian random 
variable with mean 0 and variance 1 which of course cannot converge to 
F(0) —1 when noo. A similar argument takes care of the other 
values of 2. A somewhat more elegant complex counterexample is the 
following: Let the a;’s be mutually independent, with 0 means, each with 
mutually independent Gaussian real and imaginary parts having expecta- 
tions 0 and variances 4, so that E{|z,|?} = 1. Then 

R 


F(a) =1. 


24 
Hq, Lye" Lee 


have these same properties, so that ®,/(A) is the square of the absolute 
value of a complex Gaussian random variable with mean 0, whose real 


498 STATIONARY PROCESSES—DISCRETE PARAMETER x 


and imaginary parts are mutually independent and have variance }. 
When n —> œ, ®,’(2) cannot possibly converge to F(A) = 1, for any å. 


8. Absolutely continuous spectral distributions and moving averages 
The spectral representation 


1/2 
(8.1) a, = | em ady(a), EDD = dF), 


1/2 
can be modified advantageously if F is absolutely continuous. In that 
case, if fis a Baire function satisfying 

ESFE 
there is a Ẹ(4) process with orthogonal increments which satisfies 
1/2 


(8.2) an= f fA) dg), B{\dg(A)|*} = a. 


-12 


If F(A) never vanishes, (8.2) is true with 
‘ a 1 
O f 
70) JT at 
If F(A) may vanish, the proof needs a bit more care. Let {y,(A),4<A<}} 
be a stochastic process with orthogonal increments, satisfying 
E{|dy,(A)|*} = aa, 
Eun) — y= n= (all Ae, u). 
Tt may be necessary to enlarge w space by adjunction, as explained in 


TI §2, to obtain such a i) process, Then, if ®(2) = 1 when F(A) = 0, 
and (A) = 0 otherwise, we can define y(A) to satisfy (8.2) by 


A à 
cs 1 
Hi) = J m+ J (A) dnt) 
[where we take 1//() as 0 when f(A) = 0]. In all cases, if Efa,} = 0 for 
all n we can choose (4) in such a way that Ef{dj(/)} = 0. 
As an application of this spectral representation we consider processes 


of moving averages, that is, processes defined by series of the form 


(8.3) T= 2 Fuss 
je 


§8 ABSOLUTELY CONTINUOUS SPECTRAL DISTRIBUTIONS 499 


where + +, &, &, * * + are mutually orthogonal random variables, with 
EE nent = Omns 
so that the &,’s form an orthonormal set. It is supposed that 


bs 
5 |c,|2 < œ, and this implies that the series in (8:3) converges in the 
Pes 


mean (Riesz-Fischer theorem). We have 


o 
ET minin = = finns = R(n). 

Since the left side is thus independent of m, the x, process is stationary 
in the wide sense. If the ë, process is strictly stationary, for example if 
the £,’s are mutually independent, with a common distribution function, 
the x„ process will also be strictly stationary. We shall now prove that 
a stationary (wide sense) process is a process of moving averages if and 
only if its spectral distribution is absolutely continuous. In the first place, 
if the a, process is a process of moving averages, given by (8.3), define 
(A) by a 


where the series converges in the mean. Then, using the Parseval identity, 
1/2 


R= ends = | e d 


ja, -1/2 
1/2 

= J |o(A)|2 eri da, 
1/2 


so that the spectral distribution function is absolutely continuous, with 
F(a) = |c(A/*. 
Conversely, if F is an absolutely continuous spectral distribution 


function of some 2, process, we use the spectral representation (8.2) and 
insert in it the Fourier series of f, which converges in the mean: 


SA = 2 vie” fers 
obtaining Ta 
i= f grim Ș yer di) 

-1/2 Er 


=S néne En= | em dO. 


500 STATIONARY PROCESSES—DISCRETE PARAMETER X 


Thus x, has the representation (8.3); the orthonormality of the &,,’s is 
verified by inspection. 

It is of some interest to evaluate explicitly the y(4) in the spectral 
representation (8.1) for a process given by (8.3). This is easily done, and 
we have 


we ae 
ys) — yl) = > (f| Ber?" aa) é, 
“o 
where c(A) was defined above. 
9. Linear operations on stationary processes 


Let {£ , — © <n < œ} be a stationary (wide sense) process with 
spectral representation 
1/2 


(9.1) u= if erm dyla), Ex |dy(A)|?} = dF). 
—1/2 


By a linear operation on the process we mean a transformation taking 
x, into ,, where ĉ, is either a finite sum of the form 


En = È Cnty 
d, 


or is a limit in the mean of such sums. The ĉ, process is then stationary 


(wide sense). Since 
1/2 
Z cms = | (S cje) dya 
j -1/2 j 


and since convergence in the mean for stochastic integrals corresponds to 
convergence in the mean of the integrands [A weighting dF(A)], it is clear 
that the most general ĉ„ process is given by 


1/2 1/2 
(Cp), Ue J eri" eÇ) dyl), Í Je(A)|? dF) < o, 


—1/2 -1/2 


where c(:) is measurable with respect to F measure. The covariance 
function of the @,, process is given by 


1/2 


Ro) = f em lea) ara) 


-1/2 
so that the spectral distribution function is given by 


à 
FQ) = | Jew? dru). 
0 


In other words spectral intensities are multiplied by the factor |c(A)|?. 


$10 RATIONAL SPECTRAL DENSITIES 501 


The processes of moving averages discussed in §8 are a special case; 
they are obtained by performing linear operations on an «, process whose 
variables are orthogonal to each other. 

As an example of an application of a linear operation consider the result 
of smoothing data by the use of moving averages. To be definite suppose 
that the smoothing is accomplished by averaging three consecutive «r,s, 


an = Gna t+ En + Tr). 


Then c(4) = }(e-27" + 1 + e?*) so that spectral intensities are multiplied 


by the factor 
(1 + 2 cos 271)? 


9 


4 eee H | + eam — 


This means that frequencies near A = $ become relatively less important. 
Any such averaging smooths the data at the cost of changing the frequency 
relations of the process. 


10. Rational spectral densities (in e*””) 

In this section we shall treat stationary (wide sense) processes with 
absolutely continuous spectral distributions having spectral densities 
F(a) Æ 0 which are rational functions of ene 


a 


(e274 = wy) 
, 


F(A) ay deer i : Cc 0, wi, z bri 0, 
Im (e> — zj) 
j=1 


where no w; is a z. This expression for F’ can be put in a simpler 
form as follows. In the first place no z/ can have modulus 1, since F 
is integrable, and any w; of modulus 1 must appear an even number of 
times since F’ > 0. In the second place, since F “is real, . 


x 


a 


(en — wy) (Cas — iv/) 
cermin g=1 =? e2 iy 
B P 
‘ial (en an) IT (eaz) 
j=1 j=1 


a! 
aw; LE erw; 
j=1 


zgeen 


dzy — ez 
jel 


502 STATIONARY PROCESSES—DISCRETE PARAMETER x 


Hence to each w; corresponds a wy = 1/W,, and to each z,’ corresponds 
az, ~1/z;. Since, if £ A0, 


ema El 2rià ENEA 
e Z a |fe 1| ig le 


we can write F'(A), which is non-negative, as the absolute value of any of 
the above pay and obtain 


2rià ¢ | 


[en wo) 0<|w|<1 

c> 0, 3 
hen z) 0<|z,| <1 
j=1 


Fi) =e 


Here the w,’s are the w,”’s of modulus < 1 and also those of modulus | 
counted half as many times as they appear in the w,’s; the zs are the 
2s of modulus <1. We thus can write, finally, 


> A, erh he 
(10.1) F(A) =|2——|, By A.By ¥ 0, 
5 B; eA 


B a 
where the roots of > Bz’ have modulus <1 and those of > Ajz’ have 
0 0 


roots of modulus < 1, and the two polynomials have no common factors. 

In particular, if the w, process is real, F’ is even. It follows that to 
every w; corresponds a w, = W;, and to every z; corresponds a z, = 2;. 
Hence the zs and ws are conjugate imaginary in pairs, and the A,’s 
and B;’s are real, or can be made so by multiplying the numerator and 
denominator polynomials by a suitable constant, say |Ao|/ Ao. 

According to §8 the spectral representation of a stationary (wide sense) 
process of the type considered here can be written in the form 


1/2 5 Ay e? 
00.2) æn EERTE ere EJA} = d, 
~1/2 š B on 


where the F) process has orthogonal increments. The covariance 
function is 


wp |S Aye)" 
(10.3) Ron) = | emg a, 


-1/2 5 Bye? 
Lo i 


$10 RATIONAL SPECTRAL DENSITIES 503 


Since this integral can be written in the form 


SA, z ZE 
(10.3’) R(n) = — a | getha g 
TE 1 3 Be $ Be 


dz, 


R(n) is the nth coefficient in the Laurent expansion in powers of z on 

|z| = 1 ofa rational function. It follows that R(n) decreases exponentially 

when |n| -> co (because the Laurent series converges in an annulus 

containing |z| = 1). The value of R(n) can be obtained by residues as 
B 


follows. Suppose, as above, that the roots of > B,z’ have modulus < 1. 
0 


Then, if these roots are distinct, say z4, * * *, 2s, residue theory gives the 

evaluation B 

RG) — > cg A0 
1 


B 
(The roots of 2 il i will necessarily have modulus > 1; they are 


1/%, © s 1/23) If the process is real, each non-real z; can be paired 
with a z, = %;, and c, = ĉ. If some of the 2,’s are multiple roots, the 
coefficients c; will become polynomials i inn. This form of R shows that 
R satisfies a linear difference equation with constant coefficients, and in 
fact 


i 3 Ag! > ae 
(10.4) s B,R(n + k) = a gntb— arene 
lel=1 > Bø 
= lt) n+fp—a—12>0, 
SUE he ee 


since the quotient in the integrand has no poles when |z| < 1, and has 
value 494,/B, atz = 0. Define n+ by 


B 
(10.5) 2 Bytnss = Enie 
ja 


Then 
1/2 


Eup = J eanink > Ayer dilh). 
-1/2 9 
Using (10.4) and this expression for n+p, we find 
(10.6) E{é,Zm} = 0, n—m>a+1, 
= AAB, n—m=4, 


504 STATIONARY PROCESSES—DISCRETE PARAMETER x 
(10.7) Elé En} = 0, |ja—m| >a+1, 
=S] pm] =o. 
Case 1 = 0, B= 1. In this case, 


EOI Ae 2 AAA; 
0 

and 

si z À a 

Ty, = J eri S Ayer dy) = >) Abn 

-1/2 p j=0 

where 
1/2 


= | edia), 
—1/2 

so that the sequence {¢;,} is an orthonormal sequence. Thus in this case 
(see §8) the a, process is a process of moving averages, involving averages 
of « + 1 successive terms. Note that R(n) = 0 for |n| > a. Conversely, 
if this condition on R is satisfied, F is absolutely continuous, with spectral 
density the square of the absolute value of a polynomial in ce?" of order 
<4, by (3.5). 

THEOREM 10.1 If a stationary (wide sense) x, process satisfies the 
difference equation (10.5), where By, ` ` +, Ba are arbitrary except that 


B 
BoB, # 0 and > B;z’ has no zero of modulus 1, where the first line of (10.7) 
0 


is true, and Ej\é,|?}> O, then the spectral distribution of the x,, process is 

absolutely continuous, with derivative given by (10.1), for some Ay, + ` +, Ay 
a B 

The finite poles of > Az / > Byz’ have modulus < 1 if and only if the first 
0 0 

line of (10.6) is true. 

Under the stated hypotheses, the £, process is stationary in the wide 
sense, under Case 1 discussed above. If the: Za process has spectral 
distribution function F, and if (10.5) is true, the spectral distribution 
function of the process defined by the left side of (10.5) is (cf. §9) given by 


a8 
| | > Ber Pari. 
-12 ° 


Thus, equating the two evaluations of the E, process spectral distribution 
that we have obtained, 


a ’ A A 
e =f taemten 
12") Sala 


§10 RATIONAL SPECTRAL DENSITIES 505 


Hence F is absolutely continuous, and F” is given by (10.1). If the zeros 
B 

of > Bz’ have modulus < 1, we have proved that (10.6) is true. The 
0 


same proof shows that both lines of (10.6) are true if the finite poles of 
the fraction specified in the statement of the theorem have modulus < 1, 
Conversely, if the first line of (10.6) is true, 


l > Az 

0 = EfË Em} = —: | aaia |) © |dz, n—m>a+l. 
27i ji 
kj=1 2 Bz 


Hence the Laurent expansion of the bracketed quotient on |z| = 1 
contains no powers of z higher than «— f, in any event only a finite 
number of positive powers. It follows that the bracketed quotient has 
no finite poles for |z| > 1, as was to be proved. 

THEOREM 10.2 Jf a stationary (wide sense) x, process has an absolutely 
continuous spectral distribution, with density given by (10.1), and if the 


zeros of > By have modulus < 1, then R satisfies the difference equation 
0 


B 
(10.4) $ B,R(n + k) = 0, n>—ßf+«+1 
0 
= A A,/B, #0, n=—B+«. 


Conversely, if the covariance function of a stationary (wide sense) process 
satisfies this difference equation, aside from the specification of the value 
in the second line of (10.4’), for some constants Bo,* > +, By with BoB; # 0, 
and some integer « > 0, then, if En+p is defined by (10.5), the first lines of 
(10.6) and (10.7) are true, and 


EEnt,} #0, — E{lE,|?} #0. 


We have already proved the direct half of this theorem. The verifica- 
tion of the converse is trivial. Note that Theorem 10.1 is available to 
complete the characterization of the x, process under the hypotheses of 
the converse. 

In the preceding theorems, we have avoided the case when the difference 
equation (10.5) is satisfied by £, = 0. This case is not really relevant, 
because here the spectrum of the x, process only contains finitely many 
points, that is, the spectral distribution function is constant except for a 
finite number of jumps. 

Case 2 «=0, A= 1. In this case, the £, sequence defined above 
is an orthonormal sequence, by (10.7). If E{,,¢,} = 0 for n > m, that 


506 STATIONARY PROCESSES—DISCRETE PARAMETER x 


B 
is, according to Theorem 10.1, if the roots of X B; have modulus < 1, 


0 ` 
the sum on the left in (10.5) is orthogonal to all the random variables 
+ Engp-2 Tn4p1- Hence 


Bya®n+p-1 +++ + Botn 
= B; 


is the projection of £„+s on the closed linear manifold of random variables 
generated by all previous x's. In particular, if 6 = 1, the process is a 
Markov process in the wide sense. Thus Case 2 can be described as the 
case covering those processes for which the projection of any %,„ on its 
past involves only a finite segment of the past, and in which this projection 
is not æ, itself. The last proviso is to exclude the case &,, = 0. If this 
projection is considered the linear mean square prediction of x, in terms 
of the past, the prediction error is &,/B,, and the mean square prediction 
error is 
EE q/Byl?) = 1/1 Bp 


In particular, if the a, process is real and Gaussian, and if E{x,} = 0 for 
all n, the process is strictly stationary, the above projections become 
conditional expectations, and, if 2 = 1, the process becomes a Markoy 
process; if f > 1 the process is a multiple Markov process. 

Finally, we remark that, if « = 0 in the hypothesis of Theorem 10.1, 
that is, if « = 0 in (10.7), then, as the proof of the theorem shows, the 


B 
hypothesis that > B,z’ has no zero of modulus 1 is unnecessary. 
o 


CHAPTER XI 


Stationary Processes— 
Continuous Parameter 


1, Generalities; metric transitivity 


(a) Strictly stationary processes We adopt our usual basic hypotheses: 
there is a probability measure P{-} defined on a Borel field of œ sets. 
Here œ is.a point of some space Q. A family {T,, — œ < t < œ} 
[{T,, 0 < t< o}] of transformations taking points of Q into points of 
Q will be called a translation group [semi-group] of measure preserving 1-1 
point transformations if each T, is a 1-1 measure-preserving point trans- 
formation as defined in X §1, and if 


(1) Tae TT) 


identically in the indicated parameters. The transformation Ty will 
necessarily be the identity. In the group case T_, will be the inverse 
of T,, and the family {T_,, — co < t < œ} will also be a translation 
group of measure-preserving 1-1 point transformations. 

If x is a random variable, and if {T,, — œ <1 < o} is a translation 
group of measure-preserving l-1 point transformations, the stochastic 
process 
{Tja, — œ <t < o} 
is strictly stationary. The corresponding result holds in the semi-group 
case. 

A family {T,, — 0 < £ < ©} [{T,, 0<1 < ©o}] of measure-preserving 
set transformations will be called a translation group [semi-group] of 
measure-preserving set transformations if (1.1) is true modulo the sets of 
probability 0, that is, if for every measurable œ set A, and every choice 
of T,T,A, the latter choice is one of the images of A under T,,,. The 
transformation Ty will be the identity in the sense that every image of a 
measurable set A under T, will differ from A by at most a set of probability 
0. In the group case T_, is the inverse of T, and the family 
{T_,, — œ <t < œ} is also a translation group of measure-preserving 
set transformations. 

507 


508 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


If x is a random variable, and if {T,, — œ <t < oo} is a translation 
group of measure-preserving set transformations, the stochastic process 


{T,z,— o <t< oo} 


is strictly stationary, no matter which version of Tx is taken for each t. 
The corresponding result holds in the semi-group case. 

Each translation group [semi-group] of measure-preserving 1-1 point 
transformations induces a translation group [semi-group] of measure- 
preserving set transformations, in the obvious way (see the analogous 
discussion in X §1), but all the latter groups and semi-groups cannot be 
induced in this way. 

In the following we shall use a (z, œ) “two-dimensional” measure, the 
usual direct product of Lebesgue ¢ measure and the given œ probability 
measure. 

Let {T,, — 00 < t < œ} be a translation group of measure-preserving 
1-1 point transformations and let A be a measurable œ set. Consider 
the (t, w) set of all pairs (t, w) with w e T,A, that is, the set of all pairs 
(t, œ) with Twe A. If this (f, œ) set is (t, œ) measurable for each A, 
the translation group is said to be measurable. The corresponding 
definition of measurability is made in the semi-group case. If w is a 
random variable, 

(Txw) = x(T_w) 


is measurable in the pair (t, w) if the translation group or semi-group, as 
the case may be, is measurable. The group {T,, — œ% < t< o} is 
measurable if and only if the inverse group {T_,, — œ < t< oo} is 
measurable. 

As an example of measurability of a transformation group suppose 
that Q is the linear interval (0, 1], and let w measure be Lebesgue measure. 
Let T, be the translation (mod 1) through the distance t. The point œw 
is uniquely determined by the value of e”. To prove that this transla- 
tion group is measurable we must show that, if A is a Lebesgue measurable 
set on the perimeter C of the unit circle in the complex plane, the (¢, w) 
set defined by 

e enD eA 


is Lebesgue measurable. If A is a Borel set on C, this is obvious, since 
the exponential function is continuous. If A is a Lebesgue measurable 
set on C, it can be expressed as the union of a Borel set A, and a set A, 
of Lebesgue measure 0. It is thus sufficient to show that the (t, œ) set 
defined by 

e7070 c Ag 


$1 GENERALITIES 509 


has two-dimensional Lebesgue measure 0. Since there is a Borel set A, 
on C of one-dimensional Lebesgue measure 0, with A, C Ag, it is sufficient 
to show that the (tf, œ) set defined by 


erro t) € Ag 


has two-dimensional Lebesgue measure 0. This (f, w) set has already 
been seen to be measurable. For fixed ż the œ cross section of this set 
is a rotation of A,, and so has Lebesgue measure 0, Hence (Fubini’s 
theorem) the (r, œ) set has measure 0, as was to be proved. 

Let {T,, — co <t < œ} be a translation group of measure-preserving 
set transformations, and let A be a measurable w set. Choose some 
image T,A of A under T, for each ż, and consider the (f, w) set of all 
pairs (f, œ) with weT,A. If for each A it is possible to choose the 
images {T,A, — œ <t < oo} in such a way that the indicated (z, œ) set 
is (t, w) measurable, the translation group is said to be measurable. The 
corresponding definition of measurability is made in the semi-group case. 
If a translation group [semi-group] of measure-preserving point transfor- 
mations is measurable, the induced group [semi-group] of measure-pre- 
serving set transformations is measurable. A group {T,,— œ% < t < oo} 
of measure-preserving set transformations is measurable if and only if the 
inverse group {T_,, — «© <t < œ} is measurable. If x is a random 
variable, (T,x)(w) is measurable in the pair (f, w) if the group or semi- 
group, as the case may be, is measurable, and if for each ¢ the image 
T,« is properly chosen. 

We have seen that a translation group or semi-group of measure- 
preserving point or set transformations in conjunction with a random 
variable defines a strictly stationary stochastic process. Conversely, just 
as in the discrete parameter case of X §I, if {£x — © <t < œ} or 
{æn O< t < œ} is a strictly stationary stochastic process, there is a 
corresponding translation group or semi-group of measure-preserving set 
transformations which together with the random variable x induces the 
stochastic process. Following the reasoning used in the discrete para- 
meter case, if the w space of the x, process is function space, and if x, is 
the ‘th coordinate variable, the transformations of the group or semi-group 
become point transformations, and by going to function space it is 
always possible to avoid the use of set (rather than point) transformations, 
and of semi-groups (rather than groups). 

If {T,, — © < t < œ} [{T,, 0< t < œ} is a measurable translation 
group [semi-group] of measure-preserving 1-1 point transformations, and 
if æ is a random variable, the x, process defined by setting 


x0) = [T æo) = 2(T_) 


510 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


is measurable, since, as we have just remarked, x,(~) is (t, œ) measurable. 
However, if the group [semi-group] is a group [semi-group] of set trans- 
formations, the x, process defined by setting x, = Tv may or may not be 
measurable, depending on the choice of Tæ, which is not uniquely deter- 
mined. In the language of II §2 an æ, process obtained in this way may 
not be measurable, but it will have a measurable standard modification, 
that is, there is a measurable 2, process with 
Piz (w) = %(w)} = 1 

for all ¢. 

A measurable w set is called invariant under a translation group or 
semi-group of measure-preserving point or set transformations if the set 
differs from its image under T, by at most a set which may depend on ¢ 
but which has probability 0 for each t. The invariant sets form a Borel 
field of œ sets. A random variable æ is called: invariant under such a 
group or semi-group if, for each 4, v = Ty with probability 1. In the 
group case, the same measurable w sets and random variables are invariant 
with respect to the inverse group {T_,,— © <t < oo} as with respect to 
the given group. A translation group or semi-group of measure-pre- 
serving point or set transformations is called metrically transitive if the 
only invariant sets are those which have probability 0 or 1 (these sets are 
always invariant), that is, if the only invariant random variables are those 
which are identically constant with probability 1 (these random variables 
are always invariant. 

We have seen that to each strictly stationary process there corresponds 
a unique translation group or semi-group of measure-preserving set 
transformations. The measure-preserving set transformations are defined 
on the Borel field of sets determined by conditions on the random variables 
of the process. The transformations are called shifts, and the group or 
semi-group is called the shift group or semi-group. Sets and random 
variables invariant relative to the shift group or semi-group are called 
invariant sets and random variables of the process, and the process is 
called metrically transitive if the shift group or semi-group is metrically 
transitive. A process is metrically transitive if and only if the corres- 
ponding coordinate space process, for which the shift transformations 
become point transformations, is metrically transitive. (This assumes 
that the measure in the coordinate space is the Kolmogorov measure 
determined by that of the finite dimensional sets, without further exten- 
sions.) If{T,,— 0 <1 < ©}or{T, 0< t< o}is a metrically transi- 
tive translation group or semi-group and if v is a random variable, the 
a, process determined by setting x, = Tæ is metrically transitive. If 
{æn — © <t < o}isa strictly stationary stochastic process, the processes 
fän ~ © <t < ©} and fa, OS t< oo} are also strictly stationary. 


— a J 


§1 GENERALITIES 511 


These three processes have the same invariant sets and invariant random 
variables, so they are all metrically transitive if any one is. (See the 
discrete parameter argument in X §1.) If the first or second of the three 
processes is measurable, the other two are; but, if only the third is known 
to be measurable, all one can say is that the other two at least have 
measurable standard modifications. 

Let {x,, — œ < t < ©} bea stochastic process, and let ¥ , be the Borel 
field of œw sets generated by those of the form 3 


felo) — x(w) € G} 


where G is open, that is, F 4 is the smallest Borel field with respect to 
which the x, process differences are measurable. Then the sets of Fare 
called the difference sets and F , is called the difference field. It is some- 
times useful to call the æ, process strictly stationary with respect to the 
difference field if, whenever t <: * * < ta, the multivariate distribution 
of the random variables 


Oye Bee DCi mite 


is independent of translations of the ¢ axis. In particular, every process 
which is strictly stationary is also strictly stationary with respect to the 
difference field. The Brownian motion process is an example of a process 
which is strictly stationary with respect to the difference field but not 
strictly stationary. Repeating an argument we have used already, it is 
easily seen that if an x, process is strictly stationary with respect to the 
difference field there is a unique measure-preserving set transformation 
(shift) T, for which 
T(x), — Ca) = Cept Mee 


The shifts, defined on the sets of Fa, form a translation group or semi- 
group depending on the parameter set of the given process. The concepts 
of invariant sets and random variables, and of metric transitivity (all 
relative to the difference field) are now defined as usual in terms of the 
shift transformations. 

Example | Markov chains The continuous parameter discussion of 
this example is parallel to that given in the discrete parameter case, and 
will be omitted. The obvious continuous parameter version of X, 
Theorem 1.1 is true, and its proof requires only obvious changes. 

Example 2 Processes with independent increments This example 
corresponds to the discrete parameter example of processes with mutually 
independent random variables. Suppose that {x,, O<1< oo} is a 
process with independent increments. It is then strictly stationary 
relative to the difference fields if and only if the increments are strictly 


512 STATIONARY PROCESSES—CONTINUOUS PARAMETER XE- 


stationary. According to the following theorem such a process is always 
metrically transitive. 

THEOREM 1.1 If {æ O< t < œ} is a process with strictly stationary 
independent increments, it is metrically transitive relative to the difference 
field. 

The proof follows that of the discrete parameter case (see X, Theorem 
1.2) and will be omitted. Itis made to depend on a continuous parameter 
version of the zero-one law, a version formulated in terms of processes 
with independent increments. 

Example 3 Moving averages Let the process {y,, — © < t < 0} be 
a process with (strictly) stationary independent increments for which 


Efl — yl} < œ 
Efyı— y} = 0. 
Consider the x, process defined by 
a, = | ds)dys+— | ds- À) dys). 


The y, process is metrically transitive relative to the difference field, by 
Theorem 1.1. Let {T,, — co <t < oo} be the corresponding shift group. 
The x, process is strictly stationary, and metrically transitive, because it 
is generated by the metrically transitive group {T,, — 0 < t< ©} 
applied to a, «, = T,X. 

(b) Wide sense stationary processes We use in the following the con- 
cepts of X §1 (b). A family of transformations {U,, — œ < t < oo} 
{{U,, 0< t < ©} operating on the random variables of a closed linear 
manifold of random variables will be called a translation group [semi- 
group) of isometric transformations if the transformations are isometric 
and if 

Us. = UU, 


for all s, (modulo the random variables which vanish almost everywhere). 
The transformation Uy will necessarily be the identity. In the group case 
U_, will be the inverse of U, the isometric transformations will be unitary, 
and {U_, — © <t < œ} will also be a translation group of unitary 
transformations, called the inverse group. 

If {U, — © <t < œ} or {U,, 0< t < œ} is a translation group or 
semi-group Of isometric transformations, and if x is a random variable in 
the domain of definition of the transformations, the a, process defined by 
x, = Uwis stationary in the wide sense. Conversely, if {a,— 00 <t < œ} 
or {£ O< t < œ} is a process which is stationary in the wide sense, 


§1 GENERALITIES 513 


there is a corresponding translation group or semi-group of isometric 
transformations such that, for each t, x, = U,a with probability 1. The 
isometries are defined on the closed linear manifold generated by the xps. 

If {U,, — © <t < œ} or {U,, 0< t < œ} is a translation group or 
semi-group of isometric transformations, and if x is a random variable 
such that U; =x with probability 1 for each ż, the random variable x 
is called invariant. The transformation group or semi-group is called 
metrically transitive in the wide sense if the only invariant random variables 
are those which vanish with probability 1. If the isometric transforma- 
tions are shift isometries derived from a process which is stationary in 
the wide sense, the invariant random variables are said to be invariant 
random variables of the process, and the process is said to be metrically 
transitive in the wide sense if the shift group or semi-group of isometries 
is metrically transitive in the wide sense. 

If the stochastic process {x£ — co < t < oo} is stationary in the wide 
sense, the processes {£n — 00 < f < œ} and {w,, 0< 1 < oo} are also 
stationary in the wide sense. These three processes have the same wide 
sense invariant random variables, and it follows that either all three 
processes or none of the three are metrically transitive in the wide sense. 

If {x — 00 < t < œ} is a stochastic process, with 


E{|x,— x,|?} < 0, 


let M, be the closed linear manifold generated by the differences v, — %,. 
The manifold M, will be called the difference manifold of the process, 
and the random variables in it the difference random variables of the 
process, If the process is stationary in the wide sense, relative to the 
difference random variables, that is, if the process has stationary incre- 
ments in the wide sense, there is a translation group {U,, — 0 < t < «} 
of unitary transformations defined on W, such that 


UG, — Ta) = Citt — Prt 


with probability 1. The concepts of invariant random variables and 
metric transitivity in the wide sense relative to the difference manifold are 
then referred back to the U, group. The corresponding remarks for a 
process {x,, 0 < t < œ} lead to a U, semi-group. 

The following is the wide sense version of Example 2. 

Example 4 Processes with orthogonal increments A process {2,, 
0<t < œ} with orthogonal increments is stationary in the wide sense 
relative to the difference manifold if and only if its increments are 
stationary in the wide sense, and we have seen in IV that this is. true if 


and only if 
E{|x,— x,|?} = const. |t— s|. 


514 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


According to the following theorem such a process is always metrically 
transitive in the wide sense relative to the difference manifold. 

THEOREM 1.2 If {x(t), 0< t < œ} is a process with stationary (wide 
sense) orthogonal increments, it is metrically transitive (wide sense) relative 
to the difference manifold. 

Suppose that 

E{|da(t)|?} = a dt. 


If a = 0, then a(t) = x(s) with probability 1, for each pair s, t, and the 
difference manifold M therefore only contains random variables which 
vanish with probability 1. In this case, then, the theorem is trivially true. 
If a> 0, consider the class of random variables W of the form 

© 


= i c(t) dx(t), 


0 


where c(-) is Lebesgue measurable and 


J [e| dt < œ. 
0 
Since 


ty 
at) — a(t) = | dele), 
t 
M includes every difference x(t.) — x(t,). Since 
E{| | c) dl) — J eft) de(s)|%} = a fle) — ex) at, 
0 0 0 


root mean square distance between integrands c(-) is equal, apart from a 
non-zero constant factor, to root mean square distance between the 
random variables x. Since the class of integrands is the closed linear 
manifold generated by the integrands which are | on finite intervals and 
vanish otherwise, the class of integrals is the closed linear manifold 
generated by x(t) differences, that is, M =M, The relation between 
integrand c(t) and v e M determined by the integral transformation is 
one to one, modulo integrands which vanish almost everywhere and 
random variables which vanish with probability 1. In particular, if æ is 
an invariant random variable in the difference manifold, 


z= | oft) deli) = | eft) deft + s) = | c(t — s) de(t) 
0 0 8 


with probability 1, for each s > 0. Then, comparing integrands, c(t) = 0 
for almost all ¢ in the interval (0, s). It follows that c(t) = 0 for almost 


§2 THE STRONG LAW OF LARGE NUMBERS 515 


all z, so that x = 0 with probability 1, and the x(t) process is metrically 
transitive (wide sense) with respect to the difference manifold, as was to 


be proved. 


2. The strong law of large numbers for strictly stationary stochastic 

processes 

The ergodic theorem in the continuous parameter case, whose prob- 

ability name heads this section, is usually stated as follows. Let 
{Sp — 0 <t < œ} be a measurable translation group of measure-pre- 
serving 1-1 point transformations, and let x be a measurable and integrable 
function on the space involved. Then 

I t 
lim a x(S,a) ds 


toon 
0 


exists and is finite for almost all œ. Since the inverse group 
{S_,, — © < t < œ} is a group of the same type, S_, can be substituted 
for S, in the above integrand. In this form the theorem can be generalized 
slightly by replacing the group of point transformations by a semi-group 
of set transformations to obtain the following slightly more general 
version. Let {T,, 0< t < 0} be a measurable translation semi-group of 
measure-preserving set transformations, and let x be a measurable and 
integrable function on the space involved. Then if for each t the random 
variable T,x is chosen to make (T,x)(w) measurable in the pair (t, œ), it 
follows that t 
lim f (T,2)(@) ds 


toa 


exists and is finite for almost all œ. The T, group corresponds to the 
inverse group of the S, group in the previous version. According to §1 
the following statement of the theorem differs only verbally from the one 
just given as far as the existence of the limit is concerned. In this form 
the theorem is also called the strong law of large numbers for strictly 
stationary stochastic processes (continuous parameter case). 

THEOREM 2.1 Let {x,, 0< t < ©} bea measurable strictly stationary 
stochastic process, with Ef|xo|} < 0%, and let S be the Borel field of 


invariant w sets. Then 
t 


1 

(2.1) lim F. x(w) ds = Efx | -7} 
t>o 

with probability 1. In particular, if the process is metrically transitive, the 

right-hand side of (2.1) can be replaced by E{zo}. 


516 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


It is clear that the content of the theorem is not changed if the average 
on the left in (2.1) is replaced by 


utt 


: Í x(w) ds 


u 


(fixed u) and that the limit is unaltered. If the parameter range of the 
process is the whole ¢ line, we can replace æ, by x_,, and find that the 
corresponding limiting average exists. Since the invariant sets of the 
inverse shift group are the same as those of the shift group, the limiting 
average is the same as that in (2.1). It then follows that also 


t 
ee 
lim — f x,(w) ds = E{zy | 7} 
t>o ard 
with probability 1. However, the limit 
1 te 
lim ; [evo ds 


trta t — h 
4 


does not exist with probability 1, in general. 
Since it is supposed in the theorem that the process is measurable, the 
sample functions are almost all measurable. Moreover, since 


b 


J Elle (o]} ds = 6 — oelle} < o, 


a 


it follows from II, Theorem 2.7, that the sample functions are almost all 
Lebesgue integrable over finite intervals. Thus the integral averages of 
the theorem are defined with probability 1. The existence of the limiting 
averages can be proved by giving the continuous parameter version of the 
discrete parameter proof (X, Theorem 2.1) or can be reduced to the 
conclusion of that theorem as follows, Define ĉn, Vm by 


m+1 m+1 


anlO) = Í alo) ds Salo) = | eLo] ds. 


m 


The ĉm and §,, processes are strictly stationary integral parameter pro- 
cesses. Hence by X, Theorem 2.1, 


n 
n= 


q H 1 
> ĉ (o) = lim al x(w) ds 


no 


(2.2) lim l 
nao M 
o 


§2 THE STRONG LAW OF LARGE NUMBERS 517 


exists and is finite with probability 1, and the same is true for the f; 
averages. The latter fact implies that 


(2.3) im a ei © | leko) ds = 0 
n n 
with probability 1. If [t] is the largest integer < t, 


: jaa ds = al foto as) 5 +e 


t 
d 


where, using (2.3), 


t (+1 
I 1 
(2.4) le| = kf x(w) ds) < id | leo| ds>0  (t 00) 


with probability 1. Hence the existence of the limit in (2.2) with prob- 
ability 1 combined with (2.4) implies the existence of the limit in (2.1) 
with probability 1. The rest of the theorem is proved in exactly the same 
way as the analogous part of X, Theorem 2.1. As in the discrete para- 
meter case, if ô > 1 and if Ef|x|’} < ~, 


tro 


lim rf l | x(w) ds — Efaig I) =0; 
ò 


that is, there is convergence to the limit in the mean of order 6. The 
corollary to X, Theorem 2.1, becomes here 
COROLLARY fp is real, 
t 


lim : | x(w)?" ds = E(u) 
0 


ton 
exists with probability 1. The random variable &( 1) has finite expectation, 
E{&(0)} = Eft} 
E(H(}=0, #0, 


and is transformed by the shift transformation with parameter value t into 
evi $4), Unless p is in an (at most denumerable) exceptional set, 


P{H(u) = 0} = l; 


518 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


if Ef{|29|?} < ©, then E{|&()|?} < 00 also, 
t 


(2.5) Lim. : | x(w)" ds = (u), 
0 


too 
and SPY 
EEUE) = 0, py F Mo 
The proof follows the discrete een proof and will be omitted: 
u+ 


1 1 
As in the theorem, the average = {can be replaced by the average F | 


> 


and so on. p. $ 


3. The covariance function of a stationary process; examples 


In this and the following sections we shall consider stationary (wide 
sense) complex processes {a(t), t e T} as defined in TI §8, with particular 
reference to their harmonic analysis. Since most of the theorems are 
exact analogues of those in the discrete parameter case, the details will be 
omitted. One new hypothesis will be introduced: it will be assumed that 


(3.1) lim Ef|æ(t)— a(s)|2} = 0. 
t—s-+0 


According to II §2 this continuity hypothesis implies that some standard 
modification of the a(t) process is separable and measurable. We recall 
that the transition from the original process to a standard modification 
does not affect the joint distributions of finite aggregates of 2(t)’s, Con- 
versely it can be shown that, if a standard modification of the x(t) process 
is measurable, then (3.1) is satisfied. Thus (3.1) is a minimal continuity 
hypothesis, and we shall assume that it is satisfied whenever we discuss 
Processes stationary in the wide sense. The parameter set of such a 
process will always be taken to be either (— œ, œ) or [0, œ). 

If a process is measurable, we have seen in II §2 that almost all its 
sample functions are Lebesgue measurable. Moreover, if the process is 
Stationary in the wide sense as well as measurable, the squares of its 
sample functions are almost all Lebesgue integrable over every finite 
interval, using II, Theorem 2.7, because then 

b 


J ELO dt = 6- DEl} < o. 


a 


The covariance function of a process which is stationary in the wide 
sense is defined by Au 
R(t) = Efæ(s + i)e(s)}. 


The following theorem describes the class of covariance functions. 


$3 THE COVARIANCE FUNCTION 519 


THEOREM 3.1 Jf (3.1) is satisfied, the covariance function R is positive 
definite, that is, it is continuous, and 
R(— t) = R(t) 
A 
È Rtn = tn)%m%n SO 
m,n=1 
for every finite set of parameter values ty, + + *, ty and complex numbers 
a, * + *, ay. Conversely, any function R satisfying these conditions, and 
continuous when t = 0, is everywhere continuous, and is the covariance 
function of a stationary (wide sense) process. If the covariance function is 
real, the latter process can be taken as real. 
The continuity of the covariance function follows from the inequality 


[RO — R)| = [Elet — xe) 
< [E(t — (5) YE e0) 
in which the right side approaches 0 when ż— s —> 0, by (3.1). The rest 
of the proof follows that of X, Theorem 3.1, and will be omitted. 
The continuous parameter version of X, Theorem 3.2, is 
THEOREM 3.2 A function R is positive definite if and only if it can be 
expressed in the form 


(3.2) Rt) = J ee" JF (2), 
where F is monotone non-decreasing and bounded. The function F is 
uniquely determined, if suitably normalized, by the equation 


FO +) + Fg—) . FA, +) + FA) 
2 2 
T 
» en Pritt L g- 2riht 
= jim ) a ae R(t) dt. 


(3.3) 


If R is real, (3.2) can be replaced by 
(3.2) R(t) = | cos 2714 dG(A), 
0 


where G is monotone non-decreasing and bounded, and is uniquely deter- 

mined, if suitably normalized, by 

GA +) + GA—) 
2 


ae | SOT gi ASO: 
m 
0 


(3.3’) — G(0) 


t 


520 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


The proof is simply a transcription of that of X, Theorem 3.2, to the 
continuous parameter case, and will therefore only be sketched. It is 
immediately verified that (3.2) defines a positive definite function whenever 
Fis monotone non-decreasing and bounded. The formula (3.3) is simply 
Lévy’s formula for a distribution function in terms of its characteristic 
function. From the present point of view the natural proof of Lévy’s 
formula is the following. Suppose A, < 2, and define ® by 


DUAy ree Agee Aes Ay, 
= ee ASAD As, 
= ON ue AL Aor A SAn 


Then, if ©* is the Fourier transform of ®, 
*(t) = if (Ae rit d} — | emrit dh, 


Ay 


the inverse Fourier transform is ®, in the sense that 


i 
(2) = lim J D*en dt 
Ug 


T2 y 


and the convergence is bounded convergence. Using this fact, we find 
that 
Va a oo 
lim | OHOR) dt = lim Í Dde | e dra) 
To “p To “ip Las 


? 
lim | dF(2) ji D*AN’ dt 
=T 


To 6 


o 


| DOF, 


-%0 


and this is precisely (3.3). 

Conversely, suppose that R is a positive definite function of t. Then 
the discrete parameter derivation of R as a Fourier-Stieltjes transform can 
be carried through in the continuous parameter case, when properly 
modified, or the latter case is reduced to the discrete parameter case as 
follows. For every e > 0, R(ne) defines a positive definite function of 
the integer n. Hence (X, Theorem 3.2) 

1/2 1/28 
Rine) = | eaf | er? dF 2), 


-1/2 =1/2e 


————<<ex“- °°} — er 


§3 THE COVARIANCE FUNCTION 521 


where F, is monotone non-decreasing, with total increase R(0). In the 
following, ¢ will go to 0 along the values 1/2, 1/2?, 1/23,- + +. By Helly’s 
theorem, if we define F,(2) = 0 for A<—1/2e and F(A) = R(0) for 
A> 1/2e, when e—>0 there is a subsequence of «’s along which 
lii F(A) = F(a) exists for all 2. If integration to the limit can be 


fasted! the evaluation of R(me) then becomes (3.2), valid if ¢ has the 
form n/2™ for some integral n and positive integral m; it will then be 
valid for all t by continuity. To justify integration to the limit it need 
only be shown that 


A 
(3.4) lim f dF(2) = RO) 
Aw "4 


uniformly in e, when e —> 0 along some sequence. Now, if e = 2-”, if 
T= 2""¢ = andi rim, 


omer aie | — erita 
piza 2 Raa = = a =r) f O] 


—=1/2e 


< fuos i a ay FO. 


a> A 
When m —> co this becomes 
1 r 
Fal R(t) dt} < lim wf dF,(A) +o iim Fas a dF(A) 
p 


0 


A 
aire R(0) 
= [im inf ar] (1-; ai) + AT 
=A 
Hence 
T A 
7] RO dt| < lim inf lim inf | dF,(A). 
TJ Aso mow di 


The right side is at most R(0); the left side becomes R(0) when T —> 0. 
It follows that the right side is R(0), and this is simply another way of 
stating (3.4) (with uniformity). This finishes the proof of the theorem for 
R complex. If F is normalized by adjusting it so that 


F= 0)=0, FA+)= FA), 


522 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


F is uniquely determined by (3.3). As in the discrete parameter case, if 
R is real, dF(A) is even, and (3.3’) is obtained with 


GA) = AFA) — FO+))+ FO+)—FO-), A4>0, 
=0, = 0, 


and G is uniquely determined by (3.3’), if it is normalized by adjusting it 
so that 
G0)=0, GA+)=GQa, Aa>0. 


If F is absolutely continuous, F’ is called the spectral density function 
(complex form) of the process; in the real case F is absolutely continuous 
if and only if G is, and in that case G’ = 2F’ is called the spectral density 
function (real form) of the process. When the phrase “spectral density” 
is used it is always to be understood that the spectral distribution is abso- 
lutely continuous. 

Pa 


If | |R(t)| dt < œ there is a continuous spectral density function 
given by 


(3.5) F(a) = if R(the 2 dt 


which reduces to 


3.5) CULA j R(t) cos 2nAt dt 
0 


in the real case. In fact, this restriction on the covariance function 
implies that (3.5) and (3.5) can be integrated to give (3.3) and (3.3’). 
The spectrum of the process, as in the discrete parameter case, consists 
of the numbers A in the neighborhood of which F is actually increasing. 
These numbers are the frequencies which enter effectively in the harmonic 
analysis of both the covariance function and the sample functions of the 
process. 
Example 1 Suppose that the random variables {x(t)} of a process 
satisfy 
E{x(s)a())} = 0, St, 
=0>0, s=t. 
The process is then stationary in the wide sense, with 
R(t) = 0, t#0, 


= 0%, £0) 


§3 THE COVARIANCE FUNCTION 523 


but we have excluded it from consideration since the continuity condition 
(3.1) is not satisfied. 

Example 2 Suppose that the 2(f) process is a Markov process in the 
wide sense, and is also stationary in the wide sense. Then (V §8) 


R(t) = eR), t20 c=a+ip, «250. 


feci 
R(t) = e7 RO) — o<t< %0 


and (cf. X §3, Example 2) 
x(t) = e7% x(0), = 0L t< 0, 


with probability 1, for each ż, and conversely any family of random 
variables satisfying this equation obviously defines a Markov process 
(wide sense) which is stationary (wide sense). If « > 0, and assuming 
for the moment that R is actually a covariance function, that is, positive 
definite, so that there is a spectral density which can be evaluated by (3.5), 


7 0 
F(a) = RO) { entt-2niit dt 4. RO) | a-mi Gy 
0 he 


2aR(0) 


Since this function of À is non-negative and integrable it defines a spectral 
density. The corresponding covariance function defined by (3.2) is the 
given one, which is thereby proved legitimate. If {&(), — 0 < t < œ} 
is a process with orthogonal increments, with 


E{\dé(O|} = dt, 


if c is a constant with a positive real part, and if a(t) is defined by 


o 


x(t) = if e de(t — $), 


0 


the æ(t) process has covariance function proportional to e™* (t > 0). It 
will be shown in §8 that every stationary (wide sense) Markov (wide 
sense) process can be put in this form, aside from a multiplicative constant, 
unless it has the form described above when « = 0. 
In particular, if the process is real, c = « and 
; 2cR(0) 
FO = ay ant 


524 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


Example 3 Let &,+ + +, & be mutually orthogonal random variables, 
with 
{EP} = 0? > 0, 
and let 2, * * *, 2, be distinct real numbers; define x(t) by 


rita; 
ye 


Ma 


a= > £ 


Then 
Ac k 
E{a(s + Ha(s)} = > ope = R(t) 
j=1 


is independent of s, so that the x(t) process is stationary in the wide sense. 
The spectrum contains only the points A, * * *, A,, at which the spectral 
distribution function has the jumps 0, > +, o, respectively. Con- 
versely, if a covariance function R is given by this formula, it will be 
proved below that æ(r) must have the stated form. The real process 
corresponding to this complex process is defined as follows. Let 
upi aon Up Uys > Gn Op De mutually orthogonal real random variables, 
with 
E{u?} = E{v7} = 0? > 0, 


and let A,, * * ', 4, be any real numbers. Define 2(¢) by 


k 
x(t) = > u; cos 2rth; + v; sin 27td,. 
j=l 
Then 3 
on k 
E{2(s + t)a(s)} = > of cos 2rth; = R(t) 
j=l 


is independent of s, so that the æ(r) process is stationary in the wide sense. 
We can replace any negative A; by |A,|, changing the sign of v; to com- 
pensate. Thus it is no restriction to assume that 2; > 0 (and that the 
1ps are distinct). With these assumptions the spectral distribution (real 
form) increases only at 24, * * *, A,, having jump o; at A;. The spectral 
distribution function (complex form) increases only at + A;, having jump 
of at A; if A; = 0, 02/2 at A, and — A, otherwise. With a proper choice 
of the 4,’s and £s the complex version of Example 3 reduces to this 
real one. 

In particular, if the u;’s and v;’s are Gaussian, with E{u,} = E{v;} = 0, 
the x(t) process is stationary in the strict sense. 

It will be proved below that every stationary (wide sense) process is 
either as in Example 3 or can be approximated arbitrarily closely by 
processes of this type. 


§3 THE COVARIANCE FUNCTION 525 


Example 4 Let & be any real random variable, and let a be a constant. 
Define x(t) by 
Qarité 


x(t) = ae’ 
Then 


Efa(s + D) = [al Efe} = f Pr dF) = RO, 


=. 


where F/|a|? is the distribution function of £. The x(t) process is stationary 
in the wide sense, with spectral distribution function F. This example 
exhibits a stationary (wide sense) process with any preassigned spectral 
distribution. Note that the sample functions are periodic. The choice 
of a sample function is the choice of the frequency €. Since (cf. Example 
3) the sample functions of a stationary process need not be periodic, the 
spectral distribution function does not determine the harmonic analysis of 
individual sample functions, except in some sort of average sense. This 
statement is of course not necessarily true if only certain classes of 
processes are considered. For example, if the processes considered are 
real and Gaussian, with Efa(/)}= 0, the spectral distribution function 
determines the covariance function and thereby the joint distribution of 
every finite set of x(t)’s. 

It is obvious from X §3, Example 4, what the corresponding real process 
is, and no discussion of this process is necessary. The further remarks 
made on the randomness of this process in the discrete parameter case 
are also applicable to the present case. 

Example 5 Suppose that x(t) is defined by 


z(t) = Mase 


where the y(t) process has orthogonal increments, and 
E{|dy(O|} = dt, o> 0. 
Then 
Efa,(s + 1)x,(s)} = 0 |t]>h 


o 
= i) [ers 


the x(t) process is stationary (wide sense) for fixed A, with covariance 
function R given by the preceding formula, and spectral density F ” given by 


h 
Pie 1 
FQ) = | en eaiit n (h— |t|) dt = 


—h 


— cos 27Ah 
27h? 


2 


526 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


This spectral density is very nearly o? if h is small. In other words, 
although «,(t) does not necessarily converge to a limiting random variable 
a(t) when h — 0, its spectral distribution acts as if x(t) = y'(t) existed 
and defined a stationary process with spectral density o*. No process 
with constant spectral density exists, since the spectral density has a finite 
integral over the range (— œ, + 00), but the behavior of x,(r) for small h 
makes it plausible that whenever a y‘(t) appears symbolically, as in the 
o 


stochastic integrals | S(t) dy(t) of IX, the y(t) (although symbolic) will act, 


—w 
from the point of view of harmonic analysis, like a stationary wide sense 
process with constant spectral density. Examples of this will appear 
below. 

Example 6 Suppose that x(t) is defined by 


a(t) = | er" dy(a), 


where the y(r) process has orthogonal increments with 
Ef = dF) F 0) =0, FA) = Fd). 
Then (IX §2) 


Efe(s + 126} = 1 erit AFO) = R(t) 


-0 


is independent of s, so that the a(t) process is stationary in the wide sense 
with. covariance function R and spectral distribution function F. It will 
be shown in §4 that every stationary (wide sense) process can be repre- 
sented in this way, and that in the real case the representation can also be 
put in the form 


co 


a(t) = Í cos 2nth du(2) + sin 27t du(2), 
0 


where the u(A), v(A) processes are real, with orthogonal increments, and 
E{|du(A)|?} = ELWA} = aga) 
Efdu(A) de(u} = 0. 
Example 3 above is the particular case in which F is constant except for 
a finite number of jumps, so that the spectrum of the process only contains 


a finite number of points. This representation of x(t) is called the spectral 
representation (real or complex form as the case may be). 


§4 THE SPECTRAL REPRESENTATION 527 


4. The spectral representation of a stationary process 


THEOREM 4,1 Every stationary (wide sense) process{x(t),— 00 < t < a} 
satisfying (3.1) has the spectral representation 


o 


4.1) a(t) = | er dy), 


where the y(A) process has orthogonal increments, and 

E{|dy(2)?} = dF”). 
If a y(A) process with orthogonal increments satisfies (4.1), with a suitable 
evaluation of the right-hand side, then 


Ya +) +Y —) yr +) +Y) 
2 2 


(4.2) 


en emits __ en 2mitt, 
Lim. | ier w(t) dt © < Îi Ay < 00, 


T>2 


and this equation determines y(A) uniquely, neglecting values on an w set 
of probability 0, if the y(A) process is properly normalized. 
If the x(t) process is real, (4.1) can be put in the form 


(4.1’) a(t) = J cos 2rth du(a) + sin 2rth dvl), 
0 


where the (real) u(A), v(A) processes have orthogonal increments, with 
Eldu} = ELA) = dG), A> 0 
E{(du(4)P} = dG) 220 
E{du(A) dv(u)} = 0 OS Ace 008 


If u(A), v(A) processes with orthogonal increments satisfy these conditions 
and (4.1’), with a suitable evaluation of the right-hand side, then 
fi 
ne in 2 
aise + wA=) O = lim. Í pn ait adt 0<4< 0 
pa Ti 


az) v(a +) n v(a —) vA +) +o) 


2 
r 2rth, 2rth, 
= him. | cos enira EEE) te 
T> — mti 
<7 


0< AA, < 9, 


528 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


and these equations determine u(A), v(2) uniquely, neglecting values on an 
w set of probability 0, if the u(A), (A) processes are properly normalized. 
The proof of this theorem is identical with that of X, Theorem 4,1 
(except that Fourier series are replaced by Fourier integrals), and will be 
omitted. We remark that the normalized v(A) process would have 


P{x(0 +, w) = (0, o)} = 1, 


since a discontinuity at 0 would make no contribution to the stochastic 
integral in (4.1), and can therefore be subtracted out. The integrals in 
(4.2’) can be interpreted as Lebesgue integrals involving sample functions 
of the a(t) process, if this process is measurable (see §3), and otherwise 
as Lebesgue integrals involving sample functions of a measurable standard 
modification of the a(t) process, as explained in II §2. 

The general discussion of X §4 is applicable to the continuous parameter 
case, and will not be repeated, except for the discussion of the significance 
of Example 3. As in the discrete parameter case, if the spectrum of the 
a(t) process contains only a finite number of points, the spectral repre- 
sentation reduces to a finite sum of the type discussed in §3, Example 3, 


x(t) = 2 eug = > eis Ty(A; +) — yA; —)], 
J J 


where the &,’s are mutually orthogonal. Moreover, in the general case 
each Riemann-Stieltjes sum approximating (t) is of this same type, 


2 ert Ey = 2 es YCA) — YA). 


Thus Example 3 can be used to approximate the general case in the sense 
that to every e > 0 corresponds a stationary wide sense process of the 
type in Example 3, with variables {a,(t)}, satisfying 


Efl- rA} <E —-o<t< 0; 


we need only take as 2,(t) an appropriate Riemann-Stieltjes sum of the 
spectral representation of a(t). This approximation is the justification 
for the following procedure commonly used by engineers and physicists 
in examining stationary (wide sense) processes. They write a series as 
in Example 3 to define a(t) and then increase the number of 4;’s and adjust 
the corresponding o;”s to get the function 

2 oF 

asa 
to approximate the desired spectral distribution. This asymptotic pro- 
cedure is correct in that it approximates the spectral representation 
stochastic integral by sums, but there is frequently no reason to use the 


§6 THE LAW OF LARGE NUMBERS 529 


approximating sums rather than the integral in this connection any more 
than there is to replace integrals by approximating sums in other parts of 
mathematics. 


5. Spectral decompositions 

Let A,,* * +, A, be disjunct 4 sets, measurable with respect to F measure, 
whose union is the whole A axis. Then, as in the discrete parameter case, 
a stationary (wide sense) process {x(1), — © < t < co} with spectral 
distribution function F can be exhibited as the sum of mutually orthogonal 
processes of the same type, whose spectral distributions are confined to 
the respective sets, A), © * *, Áp. One example is obtained (as in X §5) 
by the standard decomposition of the spectral distribution function F, 


F=f, +, +F;, 
where F; is the jump function of F, F, is the absolutely continuous com- 
ponent, and F; is the continuous singular component. These three 


monotone functions increase on disjunct sets (see X §5), and thus corre- 
spond to a decomposition of the process, 


a(t) = sN) + et) + ze). 


If Ay, Ag, © * + are the points of discontinuity of the spectral distribution 
function F, if F is right continuous, and if E{(dF(A)}} = dF), 


w(t) = > eya) — yA, —)], 
J 


where the bracketed random variables form an orthogonal set. The series 
converges in the mean for each fixed t. The covariance function is 
uniformly almost periodic, 


o 


RY) = | era dF) = > erty F(a) — FA, —)]. 


arr) J 


6. The law of large numbers for stationary (wide sense) processes 

The theorems of X §6 go over without any difficulty into the continuous 
parameter case; it is only necessary to replace sums by integrals in the 
statements of the theorems and in their proofs. The proofs will therefore 


be omitted. 
THEOREM 6.1 Let the x(t) process be stationary (wide sense) with 


spectral representation 
E a E{|dy(A)|?} = dF), 
a E mità dy(h j 
x(t) f er dyl) Fa-4) =F). 


-0 


530 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


Then 


oy 
Lim. z Í a(t) dt = y0) — y0 —), 


and 


fi 
RE | 
EO- v0} = FO- FO-) = lim z Í R(t) dt. 


The limit will then be O if, for example, lim R(¢)=0 or if 
oO t>o 
f [RO] dt < œ. 
0 


CoroLLary If p is any real number, 
a 


Li.m. 7 ene dt = y(u) — y(u —) 
To T 
0 


and 


= 
nana —2Qaitu 
Elid) — uJ) = FWD — Fu) = lim z | Rede dt. 
0 
As in the discrete parameter Eri the one-sided oo aboye can be 


1 
replaced py two-sided averages: 7 j can be replaced by — z7 f or even by 


n [rrer T’—T>o., The fact that one-sided E aies are 
admissi is, however, of fundamental importance (cf. X §6). 

THEOREM 6.2 The limit in the corollary to Theorem 6.1 (in particular 
when u = 0, the limit in the theorem) exists with probability 1, and is 0, 
if there is a positive K and a positive œ such that the following equal expres- 
sions are bounded as indicated. 


1 f -amid pl 1 Cf —2rin(t—s) 
B|)7.[ ae af) =i] fre- we) ds dt 
ò ò 


0 
T 


1 It| —2nipit 
=z | (1 =) O 
0 


2 f sin? nT(u— 
= ae ) IFO 


$7 THE ESTIMATION OF R(t) AND F(A) 531 
The condition of the theorem is satisfied for all u, with «= 1, if 
J [R| dt < 00; it is also satisfied for all x if R(t) < const. |e for 
0 


some x > 0. 


7. The estimation of R(t) and F(A) from sample functions 


Theorems 7.1 and 7.2 of X go directly over to the continuous parameter 
case (replacing averages by integral averages), and the continuous para- 
meter statements will be omitted. 

It is important to note that R(t) and F(A) cannot in general be determined 
from a knowledge of the sample functions in a finite interval. In fact, 
suppose that the stochastic process were known completely in the interval 
|z| <T; this of course is more than could be known from sampling. 
Suppose even that the process is known to be Gaussian, with Efa(t)} = 0. 
Then R(t) would be known for |t| < 27, but R(t) could not be determined 
for all ż, in general, and therefore the spectral distribution could not be 
determined (in both cases we mean with complete accuracy of course) 
because examples of pairs of positive definite functions, that is, covariance 
functions, have been given which are identical in an interval containing 
t=0. 

Theorem 7.3 of X goes over into the continuous parameter case with 
no difficulty, and shows how to estimate the spectral distribution in 
practice: 


My t 
8 j2 
(7.1) lim A f e2? (5) ds\ dk = F(u) — F(u), 
0 


tn 
M 
if u, and us are continuity points of the spectral distribution function F, 
but again the differentiated version 


t 


l; e7 ™ à (s) ds i = F(A) 


1 
lim — 


too 


is incorrect even if F is absolutely continuous. The limit in (7.1) is a 
limit in probability for each pair 4, 2, and is even a limit with probability 


1 if for each t, 
T. 


lim 5 | a(s + pals) ds = R(t), 


T+ 


with probability 1. 


532 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


8. Absolutely continuous spectral distributions and moving averages 
The spectral representation 


o 


(8.1) a(t) = | erid dY), E{|dy()|2} = dF) 


can be replaced by 
(8.2) w(t) = | ESD, EGD = da, 


if F is absolutely continuous and if | f|? = F’, just as in the discrete para- 
meter case, and the proof will be omitted. If E{x(t)} = 0, E{dy(A)} = 0, 
and 9(A) can be chosen in such a way that E{dj(A)} = 0. 

In the continuous parameter case a process of moving averages is 
defined as a process given by an expression of the form 


a(t) = | C*() dA + 1) = | A D dE), 
where C* is a Lebesgue measurable function, 
f IDa < o, 


and the &(A) process has orthogonal increments, with E{|dE(2)|?} = då. 
With this definition, 


Efa(s + 1)x(s)} = f C*(— 1) C*) da 


= f |C(A)|? exits dì, 
where C is the Fourier transform of C*, 


CA) = | Cyu) dy 


A 
= lim. | e"C*(u) du). 
( Lim J G) du) 
Hence the z(t) process is stationary in the wide sense, with 
RO) = | [W]e ar; 
=% 


the spectral distribution is absolutely continuous, with density |C|?. 
Conversely suppose that an 2(t) process has an absolutely continuous 


§8 ABSOLUTELY CONTINUOUS SPECTRAL DISTRIBUTIONS 533 
spectral distribution. We can then write a(t) in the form (8.2), with 
f |A|? da < co; it will be helpful to change the notation, and write 


instead 


6N A= | EPD EDP = AF). 
We can write f as a Fourier transform, 


fay = | er pr(u) du. 


If (8.2) is written symbolically as 

(8.2") x(t) = | eb AJEA) dÀ 

and if a &(A) process is defined as the Fourier transform of the &*(A) 
process, so that, symbolically (cf. IX §4) 


EO) = | EPE u) du, 


Parseval’s identity applied formally to (8.2’) yields 
a(t) = [ft Ke + w du = [fu = D ako. 


This equation is correct (ignoring the middle term) because the Fourier 
transform & process was defined in IX §4 precisely to make these formal 
operations correct. Thus we have proved that a stationary (wide sense) 
process is a process of moving averages if and only if its spectral distribution 
is absolutely continuous. 

As an example consider the stationary (wide sense) Markov (wide sense) 
processes, discussed in §3, Example 2. The spectral distribution of such 
a process is absolutely continuous, with 

2aR(0) 

Jo + 2nia|” 
where c has positive real part æ (we exclude the degenerate case « = 0). 
In this case it is easily verified that we can take 
C#(t) = [2aRO)}%e%, t0 
=0 t>0 
[2xR(0)}? 
c+ 2rih 


ra= RG) =e "RO, (=), 


cA) = 


534 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


so that x(t) can be put in the form 


æ 


w(t) = [2xR(0)]"? i e-% d&(t— 2). 
0 
It is easy to derive the (wide sense) Markov property from this repre- 
sentation of x(t) in terms of the past of a process with orthogonal incre- 
ments. 


9. Linear operations on stationary processes 
Let {2(t)} be the variables of a stationary (wide sense) process with 
spectral representation 
o 


(9.1) a(t) = | er dy(”) Bf 


=o 


dy} = dFQ). 


By a linear operation on the x(t) process we mean a transformation taking 
the 2(¢) process into an @(r) process, where 2(f) is either a finite sum of 
the form 


æ 


H(t) = F Celt + 4) = | PLS Ce] dy(A) 
j j 


-0 


or a limit in the mean of such finite sums. Since mean limits on the left 
correspond to mean limits of the bracket [with 2 weighting dF(A)], the 
most general @(t) process is given by 


w 


a(t) = Í e274 C(A) dy(A). 


Here C(A) is any limit in the mean of finite sums > C,e?”4* and as such 
j 
may be any function which is measurable with respect to F and for which 
J |C(A)|? AFA) < oo. 
The function C is called the gain of the operation. Thus every linear 
operation has a gain, and every gain determines a linear operation which 


defines a new stationary (wide sense) process. The new process satisfies 
the continuity condition (3.1) because 


EJO — 26) %} = [err — era] Ca)? dF) 


0, t—s>0. 


§9 LINEAR OPERATIONS ON STATIONARY PROCESSES 535 


The new covariance function is given by 
RO = | EAC dF, 


so that the new spectral distribution function is given by 
a 
Fa) = | [CUD]? dFW); 


spectral intensities are multiplied by the factor |CO). If the gain is 
identically 1, the linear operation is the identity. 

‘If a linear operation with gain C, takes an 2(t) process into an 2;(*) 
process, and if one with gain C, takes the x,(t) process into an x(t) 
process, the one with gain C,C, takes the x(t) process into the x(t) process, 
and, if x(t) is given by (9.1), 


at) = | GAGA dy(2). 


The only conditions (besides measurability) on C, and C, are that 


o 


J IGORA | |Q@csaP ara < o. 


If in addition a 
| la@Para < o, 

the linear operations with gains C, and C, can be performed successively 
on the 2(¢) process and on the resulting process, obtaining again the 
a(t) process. Thus when several linear operations are performed success- 
ively the result is a linear operation with gain the product of the individual 
gains, and the operations are commutative in the sense that if the operations 
can be performed in different orders the resulting process is unaffected 
by changes in the order. J 

The sum of two linear operations is defined as the one with gain the 
sum of the gains of the operations. It yields an æ(t) which is the sum 
of the processes resulting from the operations. 


Example 1 Differentiation Suppose that Í 22 dF(2) < œ, and con- 
sider the gain 27i, ae 


(0) = | 2nide* dy). 


536 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


Since 


dy(), 


a(t + h)—2(t) (eee e2rith 
h h 


the fact that [with A weighting dF()] 
e2nilt+mya_ grit 5 
aim e E 
h0 h 


implies that 


Li.m. 
h0 


aco 1) = i 2rrihe?""® dy(A), 


so that ĉ(/) is a mean square derivative; in this extended sense the «(¢) 
process sample functions have derivatives. If now as usual æ(z, w) is the 
value assumed by the random variable x(t) at the point œ, we shall 
strengthen the foregoing result by proving that, if the x(t) process is 
separable, almost all sample functions of the process are absolutely con- 
tinuous, and in fact that, if a’(-, œw) denotes the derived function of the 
sample function 2(-, œ), then, for each ż, 
#'(t,*) = 20) 

with probability 1. Without the separability hypothesis we can only 
prove the (equivalent) result that, if R is any denumerable parameter set, 
almost all sample functions coincide on R with functions defined for all #, 
which are absolutely continuous, and whose derived functions satisfy the 
above relation. For given f the random variable ĉ(z) is uniquely defined 
only up to values on w sets of probability 0. We have seen in §3 that we 
can therefore suppose that this random variable is defined in such a way 
that the 2(¢) process is measurable, and that almost all sample functions 
of the #(r) process are Lebesgue integrable over finite intervals. Then 


t 
f ĉ(s, w) ds 
0 
defines an absolutely continuous function of ż, for almost all œw, and each 


such z function has derivative &(t, œ) for almost all £ Moreover, dropping 
w from the notation, as usual, 


j &s) ds = j ds J ainera dyl) = je — 1) dy(A) 
0 0 =o <0 


= x(t)— (0), 


with probability 1, since (IX §2) the order of integration can be reversed 
in the iterated integral. This equation is to be interpreted as follows: for 


§9 LINEAR OPERATIONS ON STATIONARY PROCESSES 537 


each ¢ the left and right sides are equal, with probability 1. Then the two 
sides are equal with probability 1 simultaneously for all values of ¢ in 
any prescribed denumerable set S. If S is chosen, as can be done since 
the a(t) process is separable by hypothesis, so that the upper and lower 
limits of almost all a(¢) process sample functions coincide on open 
intervals with these limits for ¢ restricted to S in these intervals, it follows 
that almost every 2(f) process sample function 2(-, w) is absolutely con- 
tinuous, with derived function 2(-, w), as was to be proved. 

We have shown that the operation with gain 27iA corresponds to mean 
square and ordinary differentiation. Conversely, suppose that mean 
square differentiation is possible; that is, we suppose that 

Lim. [a(t + h) — x(t))/h 
ho 
exists; in 2 language this means that we suppose that [using å weighting 


AEI eri L e2rith 


im 
hd h 
exists. The limit must be 27iAe?" since where both mean and ordinary 
limits exist they must agree, and it follows that 


ji | ninetiel? dF) = 4r? | #2 AFG) < o. 
Thus we have shown that the operation of mean square (which implies 


ordinary) differentiation is possible if and only if Í 22 dF(A) < ©, that is, 


if and only if high frequencies are not too prevalent. Note that the 
existence of ordinary derivatives does not necessarily imply that of mean 
square derivatives. To see this consider the following example. Let the 
y(t) process be a separable Poisson process with parameter ¢ > 0 and let 
u(t) = y(t + 1)— y(t)— c. The a(t) process is strictly stationary, as well 
as stationary in the wide sense; the spectral distribution is absolutely 
continuous with density 
1—cos 2m 
2n?) 
Since 


Jei eae $o; 


the mean square derivative does not exist. On the other hand, 2(t, w) 
exists and is 0 for each w, except at a denumerable 1 set, considering only 


538 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


those sample functions, whose aggregate has probability 1, which change 
only in unit jumps. 

Example 2 Integration Suppose the gain C can be expressed as the 
Fourier transform of an integrable function, 

Cia) = | er" C* u) dy, | |C*(W)| du < œ. 
The corresponding linear operation can always be performed, since C is 
bounded and continuous. This operation can be identified with an 
integral averaging, 
R(t) = | eCa) dyl) = J ert dyf) J e C#(u) du 


=o 4 


= | dp f Amer dy 


= f D elt + 9 du. 


(The interchange of order of integration is justified by the discussion in 
IX §2.) Conversely suppose that one begins with an integral average 


| crate + 0 du, 


with as yet unspecified conditions on the averaging function C*. The 
natural condition to impose is the absolute convergence of the double 
integral 


Ef f CH(wha(t + u) dy} 


and this condition leads back to the preceding condition that 


o 


J |C*()| du < œ by way of the inequality 


=o 


| f (xlt + | du) = f Icruole tec + wl} du 


< | OE e + o|?) du 


=o 


= E!{[e(0)} | |C*G0| de. 


§9 LINEAR OPERATIONS ON STATIONARY PROCESSES 539 


In the following therefore, in performing averaging, we shall always 
assume that the averaging function C* is absolutely integrable over 
(— œ, ©). The gain will then be its Fourier transform C, so that spectral 
intensities will be multiplied by |C|?. It is instructive to consider the 
following degenerate case: let a &(r) process have orthogonal increments 
with E{|d&(1)/?} = odt. Then formally &(t) is, as we have seen, a 
process with constant spectral density o. According to this, the integral 
average 


w % 


a(t) = | EN + uydu = f CHW) ake + D 


SR 2o 
should have spectral density |C]|?o?; as a matter of fact, the x(t) process 
is a process of moving averages with this spectral density (cf. §8); in this 


case the appropriate condition on C* is that 
Py 


J [CAP du < œ. 
To 
If several operations of integral averaging are performed successively, 
the result is a linear operation which is also an integral average, and the 
averaging functions combine by convolution. We prove this for two 
operations. Suppose, then, that C,*, C,* are two averaging functions; 
we must prove that, if C* is defined by 


C*(u) = j Ci*(u — DC 0) d, 
then C* is the averaging anes corresponding to the repeated operation, 
that is, [cron du < œ and C* is an averaging function with gain 
(Fourier transform) C,C,. On the first point, 


NGOLE f f Cte aera] d du 


= f IG] du | ICF a < o. 


On the second, 


j ertnC*( u) du = ih rite du Í C*(u— 2)C,* (0) d2 


= i eac (A) dÀ il eeru-DC *(u— 2) dlu — 4) 


= E(u) CAH). 


540 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


We have of course here simply verified the well-known fact that convolu- 
tion of functions corresponds to multiplication of their Fourier transforms. 
The various interchanges of orders of integration were all justified by the 
absolute convergence of the integrals involved. 

We now consider gains of the form 2idC, where C is, as above, the 
Fourier transform of an integrable C*, and (the necessary condition for 
a gain) 


(9.2) ant [ PICO? AFG) < œ 


is satisfied. Under these conditions the linear operation can be con- 
sidered the result of two successive operations: integral averaging with 
averaging function C* followed by differentiation. The condition (9.2) 
legitimizes the differentiation. Since we have not supposed that 
o 


Í 12 dF(2) < 00, it may not be possible to reverse the order of the two 


operations. However, we shall show that the result of the successive 
operations can always be written in the form 


(9.3) a) = | CH delt + 1), 
which reduces to the result of differentiation and integral averaging, 


AO = | uwt + u) du, 


if Í 22 dF(A) < œ, so that the x'(t) process exists. We put this problem 


in a more general setting by discussing in detail stochastic integrals of 
the form 


o 


| PO ae 


=% 


for æ(t) processes which are stationary in the wide sense. If the spectral 
representation of æ(t) is 


a= | Ed, — E{\dy(2)|*} = dF, 


and, if we observe our usual convention relating starred and unstarred 
functions, 


o 


(9.4) fd) = | fods, 


=o 


§9 LINEAR OPERATIONS ON STATIONARY PROCESSES 541 


then formally 


(9.5) J P*(t) det) = 2ni J J nee" f*(t) dt dy() 


g3 


= i f Aft) dy). 


Jo 
Now integrals like the last one have been defined in IX §2. The condition 
imposed in that section becomes 


(9.6) | Bf)? aFQ) < 


here. We define the left side of (9.5) as the last integral in (9.5) for every 
function f* whose corresponding Fourier transform f satisfies (9.6). The 


method of defining f must be unique up to a å set over which je dF(A) 


vanishes (or the stochastic integral would not be uniquely defined), and 
it should be linear in the sense that af* + bg* should correspond to 
af + bg (or the stochastic integral would not be linear in the integrand). 
Finally f should be the usual Fourier transform if f* is, say, absolutely 
integrable over (— œ, 00)? It will be sufficient below if we restrict our- 
selves to functions f* which are integrable over all finite intervals, for which 


A 
lim f ef) at 
Ao "a 
exists for all A, defining the transform f which satisfies (9.6). This class 


is certainly admissible in accordance with the principles just stated. 
According to the definition (9.5) 


(9.7) E| J S(t) det) f g0 de(1)) = 4n? if RAAZ) dF(A). 
In particular, if i 22 dF(A) < , the derived x(t) process exists, 


a(t) = 2ni i} Je" dyl), 


and (combining the present definition with the discussion of Example 2) 


ca ca 


J Poa f fOr @ di, 


wo -0 


f | f*(@| dt < ©. 


542 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


Using the preceding results, we find that, as stated above, if C* is 
absolutely integrable over (— 00, co) and if C satisfies (9.6), 


J cen dete + 1) = J CAH = dew = ri f ACE dy), 
that is, the operation with gain 2miAC can be written in the form (9.3). 


10. Rational spectral densities 
In this section we shall treat stationary (wide sense) processes with 


absolutely continuous spectral distributions having rational spectral 
densities F’(A) Æ 0, 


The numerator and denominator can be supposed to have no common 
roots. The reality of F’ means that 


Tla-w) Tla- 
ca =¢ > 
TT a= i @—#/) 
so that the imaginary z,’s and the w,'’s must be conjugate in pairs. More- 
over, since F’ is integrable, no z; can be real and, since F’ > 0, every real 


w; must be a root of even multiplicity and c = ë> 0. Hence, using the 
fact that |A— &| = |A— &|, we can write F’ in the form 


| > @—w)? 
(10.1) F(A) = ¢ 44 —___ 
| > a-z) 
j=1 
z Aji | ABB; # 0, 
Ja 
$ Bw B> a, 
j=0 


Here f > « because F’ is integrable, and we can suppose, if convenient, 
that the roots of the denominator and the imaginary roots of the 
numerator have positive imaginary part. We shall always suppos¢ that 
numerator and denominator have no common root. If the roots are 
chosen as described, the A,’s and B;’s are uniquely determined up to a 
proportionality factor. 


§10 RATIONAL SPECTRAL DENSITIES 543 


In particular, if the x(t) process is real, F’ is even, and we can write 
(10.1) in the form 
a 2 
> Ar (0) A; = Aji? 
(10.1) F(a) =| 4 
> Bj (id) B= Bs 
0 


where the A,’’s and B,’’s are real. 
According to §8 the spectral representation of the process can be 
written in the form 


gen Sat 
(10.2) a(t) = J ert © dea), E|} = a, 
SeN > Bw 
0 


where the 2(A) process has orthogonal increments. The covariance 
function is given by 


a 2 
Š PY 
(10.3) R(t) = J enti 0 _) a 
—wo > By 
0 


POW Rh 
= | ern ¢ _t di. 
“0 SBD BN 
0 


According to a standard residue argument, R(t)/(27i) is the sum of the 
residues of the last integrand in the upper half-plane, if £ = 0. Moreover, 
if L is a simple closed rectifiable curve in the upper half-plane, which 
contains in its interior the zeros of the denominator in the upper half-plane, 


then R(t) is given by 
a a 
5 A% > AW k 
: 4 20, 
(10.4) RW) = [ orf g rida 9 
z > Bd > BY i 
0 0 
Here R™(0) is to be interpreted as the one-sided derivative R™(0 +). 


Hence R)(t)/(27i) is the sum of the residues of the integrand in (10.4) 
in the upper half-plane. The function R is thus indefinitely differentiable 


for t > 0, and also for t < 0, since R(— t)= RO. The residue evalua- 
tion of R(t) yields, if we suppose that the zeros of > Bø have positive 
imaginary parts, y PE 
(10.5) RO= FGE 120, [R-1)= RO) 

j 


544 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


P 
where C; is a polynomial in f, and the z;’s are zeros of > By’. If Ris real, 
we can write G 
R(t) = > (Cj cos 2na;t + C;” sin 27a;t)e ? 
j 
where C,’, C,” are real polynomials in t, and a;, b; are respectively the real 
and imaginary parts of 2;. 

Thus R has one-sided derivatives of all orders at 0, and R®(0 +)/(27i) 
is the sum of the residues of the integrand in (10.4) (with ¢ = 0) in the 
upper half-plane. Similarly, — R®(0 —)/(27i) is the sum of the residues 
of the same function in the lower half-plane. Hence the quantity 


R®(O +)— R®O-—) 
2ni 


is the sum of all the residues of this function in the finite plane, that is, 
the coefficient of 1/4 in the power series expansion of the function in a 
neighborhood of œ. Hence 


0.6) R®(0 +)— R®O—) =0, EaR 25 — 2, 


= la (2ri) t k=2ß—2x— | 
|B|? j 

The first line here is also an obvious consequence of (10.3), since differen- 
tiation under the sign of integration in (10.3) is legitimate for the values of 
k involved. Note that, according to our evaluation of R(t), R(t) —> 0 
exponentially when |t| -> œ. 

We can deduce immediately, from either (10.4) or (10.5), that R satisfies 
the differential equations 


L B 
2 Grif ROO =O, te0-5 


(10.7) 
5 B, ROD) =0, t<0 
0 (27i} Bee = eat 
where the qualifications on the right mean that the equations hold on the 
closed half-lines, using the appropriate one-sided derivatives when t = 0. 
Then 
BB 
10.8 iE 
ua j, k20 (27i) tt 


According to (10.6), this differential equation cannot be satisfied when 
t = 0, because the first 2f derivatives of R do not exist at that point. 
Eyen without (10.6), the non-existence of these derivatives is readily 


RIH) pe 0, ts 0. 


§10 RATIONAL SPECTRAL DENSITIES 545 


deduced from the fact that R is bounded, whereas no solution of (10.8) 
B 

for all ż is bounded if > B,z’ has no real zero. 
0 


THEOREM 10.1 If the stationary (wide sense) process {æ(t), — © < t <0} 
has an absolutely continuous spectral distribution, with density given by 
B 


(10.1), and if the zeros of X, B; are in the upper half-plane, then R satisfies 
0 


the differential equations (10.7), and the boundary conditions (10.6). Con- 

versely, if the covariance function of a stationary (wide sense) process 

satisfies (10.7), for some constants Bo, ` ` `, Bz, if BoB; # 0, and if no zero 
Ê 


of > Bz’ is real, then, if m is the smallest value of k for which R®(0 +) 


0 
+ R®(0—), it follows that the process has an absolutely continuous 
spectral distribution, with density (10.1), for some constants Ag, ` * `s Aw 
where 24 = 2B —m—1. 

We have already proved the direct half of this theorem. To prove the 
converse, we note that, if R satisfies (10.7), it must be indefinitely differenti- 
able for t >0-+. Moreover, as a solution of (10.7), R must be given 


B 
by (10.5), where the z;’s are zeros of 5 By and the C,’s are polynomials 
0 


inż. Since R is bounded, and since there are no real z;'s, by hypothesis, 
only z;s with positive imaginary part can actually appear in the expression 
for R(t), Then R(t) vanishes exponentially at ©. Hence [cf. 3.5)], F 
is absolutely continuous, with 

F(a) = | e RO dt, 


=o 


and, integrating by parts, 
© g 2milt 
FO) == (1) 
o- j nod 
i j-1) REDO — 2 priit 
> REVO +) Ri (0) ; i E — R™(t) dt, ne 
j=2 (2rd) he (2aid)” 
Then a 
"F’(A) = a EAER ae, 20, 
IFA) = Pama t Gap | ° Odr n 


=o 


where p, is a polynomial of degree k, if k = 0, and vanishes identically if 
k <0. It follows that 


B ê att, 
(10.9) [> BePFA= > B;B,A**F (A) 
0 j,k=0 


° ye RS ; 
=I —2riit i k plk) 
IA +f — Gane ® (0) dt, 


546 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


where A(-) is a polynomial of degree 2ĝ— m— 1. Since (10.7) implies 
(10.8), the integral pe and (10.9) reduces to 


x 


= A(), 
and we have now napa E converse half of the theorem. Note that, in 
the discussion of the converse, the expression (10.5) for R(t) can contain 


only z,’s with positive imaginary part, since R is bounded. If 2 B,z’ has 


zeros with negative imaginary parts, they do not appear in (10. 5). It then 
follows that R satisfies a differential equation of order < f, so that, in 
the representation of F’ just obtained, the numerator and denominator 
polynomials have common roots. 

We proceed tovanalyze in more detail the case when « = 0, Ay = | in 
(10.1), corresponding to Case 2 in the discrete parameter discussion of 
X $10. In this case 


F’) = B>0, BB; +90, 


B 
[> a 
0 


and it is convenient to write the spectral representation in the form 


S et 
a(t) = | a —— dl), EJAY = dA. 


> Ba 
j=0 
Then 
i. 
se [ 26- dF) < o, 
a” 


so that the first" — 1 derived processes exist; if the process is separable, 
the sample functions have 6 — 1 derivatives with 


(Qril) er” 
xl) = | a A CALE Ny eG 


F 
—o 


Hence formally _ 


x(t) =, T e7 dz*(2) 


a) 


B 
10.10) B, EL yi RS B 
(1010 BeO t OF t 


= 2z(1), 


§10 RATIONAL SPECTRAL DENSITIES 547 


where (cf. IX §4) the z(t) process is the Fourier transform of the 2*(A) 
process. That is to say, formally the a(t) sample functions satisfy a 
stochastic version of (10.5), a differential equation of order f, linear with 
constant coefficients (which are real if the proces: s real) whose right-hand 
side is the (fictitious) derivative of a process wit orthogonal stationary 
(wide sense) increments. With some stretching of the mathematics the 
fictitious 2’(t) here can be said to have the property that 


E{z’(s)z(0} 0; s#t, 


corresponding to the orthogonality of the 2(t) increments.. Thus the 
differential equation is the analogue of the difference equation X (10.5) 
in the discrete parameter discussion, X §10, Case 2. In order to justify 
the symbolism of (10.10) we prove that, for a large class of functions f*, 


Bua 
(27i) 


(10.10) By Í fO) dt +: + Í fra?) dt 


B; i i 7 
A t) de® D(t) = | f*(t) dzi). 
tap TO to- [ro 
In this equation the terms on the left are ordinary Lebesgue integrals of 
sample functions, except for the last one. The last one was defined in 
§9 for functions f* satisfying t 

: 


fi f*O| dt < ~, i 22| IDPA ©, 
A a 


where a 
© i 


fd) = J eeri f*(t) dt. 


-o 


The integral on the right was defined in IX §2 for functions f* satisfying 


J | f*(O/P dt < 0. 


We shall prove that (10.10’) holds with probability | for any Baire function 

J* satisfying all these conditions. For example, if f* is continuous with 

a continuous derivative, in some finite closed interval, and vanishes 

outside the interval, these conditions will certainly be satisfied. To prove 

(10.10’) we need only remark that in accordance with the integration rules 
$ 


548 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


already derived [in particular see (9.5)], the left side of (10.10) can be 
expressed in the form 


oo 


4 5 By! 
| ara | 2— pre” dt 
=o > Bw 

-0 s; 

= | D0 

= | POr 


since the 2(2) process is the Fourier transform of the z*(4) process. 
Conversely suppose that the first f — 1 derived processes of a stationary 
(wide sense) x(t) process exist and that there is a z(t) process with orth- 
ogonal increments, with E{|dz2(1)|?} = dt, for which (10.10) holds in the 
sense that (10.10’) holds. Then we prove that F is absolutely continuous, 


with F’(A) = š B,j'|-*. The principle of the proof is that the (fictitious) 
2’(t) process hae constant spectral density 1 and that the left side of (10.1 0), 
as the result of a linear operation on the x(t) process with gain 5 BH, 
has spectral density $ B,i’|?F'(4). Equating the two spectral densities 
gives the required result, More rigorously, we observe that, by (10.10’), 
[resto [= a 2i) dt + ay aeo-0(o| 


æ 


= | s+) ei 


=o 


with probability 1, if f* satisfies the conditions imposed above. Now the 
left side of this equation is the result of a linear operation on the x(t) 


process with gain f(4) > B,d’, so that the process on the left has spectral 
0 


distribution function given by 


: B 
[iro È Bp |? dF(u). 


$10 RATIONAL SPECTRAL DENSITIES 549 


The right side of this equation defines a process of moving averages, with 
spectral distribution function given by 


J |A|? dp. 


Equating these two distribution functions, we find that any discontinuity 
B 

of: F must be a zero of > B,A’, and that between discontinuities F is 
0 

absolutely continuous, with 


FA) =- , 
[> BaP 
0 


Since F’ is integrable, the denominator does not vanish. Then F is 

continuous, and therefore has the stated form, as was to be proved. 
From the point of view of prediction theory it is important to know 

when the 2’(t) in (10.10) is orthogonal to the past of a(t), that is, when 


Effet.) — tO} =0, SSL h < te 
This will now be shown to be true if and only if the roots of > Bw all 
have positive imaginary parts. In fact, if f* in (10.10’) is defined by 
J*O=1 hitch 
=o otherwise, 
we find that 


te 


a1 B, Aptos B i is aia 
ay | Ele} dt + orp Ble Mt) — OD] zO} 


4 


= Effett) — APO} 


The left side of this equation can be put in the form 


S ie o Byer) a a Bale = e2ritr—s)] z 
t a 
[j ae 
| | PA | Drill: B 
al 0 0 
ù =o =o 


©: 5 
et 8) a ets) 


Ta Ê 
| 2mid > BH 
0 


550 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


D ET 
If the roots of X By’ have positive imaginary parts, those of > B’ have 
0 0 


negative imaginary parts and the last integral above must vanish, by a 
residue argument. The same argument shows that the integral does not 
vanish for all s, f, fg obeying the stated inequalities unless the roots of 


f 
> B,z all have positive imaginary parts. 
0 
Thus, if F'(A) = |X B,/'|-*, the sample functions satisfy a stochastic 
o 


differential equation (10.10) of a very simple type. Since |A— é| = |A—¢| 
for real 2, and since we have not supposed in discussing this differential 


equation that the roots of > B,z’ have positive imaginary parts, the B,’s, 


0 
and thereby the differential equation, can be changed without affecting 
the process simply by replacing roots by their conjugates. Thus the 
sample functions of the given process satisfy many differential equations 
(10.10), but there is only one for which the z’(t) is orthogonal to the past 
of ne a(t) process in the sense described above, that for which the roots 


of > B,z all lie in the upper half-plane. 


0 
The differential equation (10.10) is of a standard type whose solution 
is known, 


$- 
(10.11) aa = Sal — tox” (to) + fea- s) dz(s), toy 
jo tL 


where @,* * *, djs g ate functions which can be written down explicitly 
in terms of the given coefficients Bọ, * + *, Bs. This solution is of course 
usually used for a fixed function 2’(r) on the right in (10.10), but the right 
side of (10.11) is defined in the present circumstances, and, since the 
stochastic integral involved obeys the usual rules, a(t) as defined by 
(10.11) must actually satisfy (10.10), that is, (10.10’). If the roots of 


B 
5 By” have positive imaginary parts, we have seen that the z increments 
0 
involved in (10,11) are orthogonal to each random variable a(t) for t < to. 
Then the integral in (10.11) is orthogonal to a(t) for t < f, so that 

a1 

X a(t — t)x(t0) 

j=0 


is the projection of z(z) on the past up to time fọ. In particular, if 6 = 1 
the process is a Markov process in the wide sense. In that case (10.11) 
becomes 

2ri ‘ : B, 
(10.11) a(t) = Het) + SE Í et) de(s), = b = 2ni 

B, B, 


te 


$11 STATIONARY (WIDE SENSE) INCREMENTS 551 


where B, + B,z = 0 defines a root — b/27i with positive imaginary part, 


so that b has a positive real part. When fọ >— %, (10.11’) becomes 


t 
2ri 


(10.12) x(t) = EF | eX) da(s) 
1 


=% 


and in the same way (10.11) becomes 
t 
(10.12) a(t) = J g(t —s) des). 


This expression displays the x(t) process as a process of moving averages 
of a special type; æ(t) depends only on the past of the z process, The 
form (10.12’) has already been written down in §3 (Example 2). 


11. Processes with stationary (wide sense) increments 
Suppose that © + *, %, 2, * * * are random variables for which 
E{ |, — Eml} < 00 
for all m, n and for which 
r(my, m; May No) = Ef lEn, — Vm) Enq — Tm} 
is independent of translations of the parameter axis, that is, 
r(m, + h, m + h; m, + hy na + h) = r(m n, May ng) 

for all integral h. Then the a, process will be said to have stationary 
(wide sense) increments. The continuous parameter case is obtained by 
allowing the parameter values to be any real numbers, and we shall always 
also make the following continuity hypothesis [writing in this case æ(t) 
instead of x,] 

lim E{le(¢ + A)—a(¢)|*} = lim r(0,h; 0, A) = 0. 

hod h0 


In the discrete parameter case the «,, process has stationary (wide sense) 
increments if and only if the differences '' ', ¥ı— o %—%, °° * 
constitute a stationary (wide sense) process, in which eyent we can write 
1/2 
zya | EPA, EYY = AFO, 
~1/2 


where the (4) process has orthogonal increments. Then 


i RRS 
einink L 2mm 


n-1 
(11.1) En En p2 (W541 “=f oui] dyl) 
2 


and 
UP (aningd  e2rimay (gin — gm) 
(11.2)  r(m, m; Mg n) = Í £ E 1? dF(A). 


1/2 


552 STATIONARY PROCESSES—CONTINUOUS PARAMETER Xt 


The principal purpose of the following analysis is to derive the analogues 
of these formulas in the continuous parameter case, and only this case 
will be considered in the remainder of this section. 
Example 1 If an a(t) process is stationary in the wide sense, 
t 


a(t) = EXO) ds defines a process whose increments are stationary in the 
0 
wide sense, 

Example 2 If an x(t) process is stationary in the wide sense and if 2% 
is an arbitrary random variable, x(t) = 2 + x,(t) defines a process whose 
increments are stationary in the wide sense, If the spectral representation 
of the x,(t) process is 

a) = | er dy (a), nD = dF) 


= 


we have 


x) — x48) = f (2 — eH) dy (2) 


wo 
F mi ia hy (GIA L prisa 
MSp ty} So h) = if (cenit — eris) gern — g) dF (A). 


=o 


We shall prove that Example 2 approximates the general case in the 
sense that every process with stationary (wide sense) increments has the 
spectral representation 

œ% 


(11.1) a(t) — a(s) = | 


09 


erit L erish 


2)1/2 
a + dy), 


where the y(A) process has orthogonal increments with 


Ef{|dy(@)|?} = dH) 
and H is a bounded monotone function. It is clear that (11.1’) always 
defines such a process, with 
(11.2) rsi ti; So te) = E{x(t,) — 2(s,)[2(t2) — 2(sy)]} 


x i 
[ (e27iht — g2aindy(Qemtat — pre 


47°? 


da + 4?) dH(A). 


=o 


[In (11.1’) and (11.2’) the integrands are defined by continuity when 


A= 0; the first is t— s, the second (t; — 5;)(t2— 5g).] Formulas (11.1/) 
and (11.2’) are the continuous parameter versions of (11.1) and (11.2). 


$11 STATIONARY (WIDE SENSE) INCREMENTS 553 


< œ (and only if this is true), (I1.1’) can be written in the 


1 
. [| dH) 
If | z 
-1 
form of Example 2, 


a(t) = | e. dyl) + to, 


=æ 


1+2 
EO = gy HO 


xa = (0) — yı(0) + Yı (%0) 


a 
(l Æ ns 
j= | —— j 
na = | EE an 
In the most general case, according to (11.1), x(t) is very nearly a sum 


of the form n > 
a(t) = 2(0) + tyo + È e y,, 
j=1 


where Yo, Yı °° * are mutually orthogonal. 

To derive the representation (11.1’) we observe first that since the a(t) 
process has stationary (wide sense) increments the same is true of the 
discrete parameter process (m fixed) {a[(n + 1)/m)]— x(n/m),n = 0, + 1, 
- +}. The spectral representation of this process has the form (after a 
change of the integration variable) 


a (9) 26 e £ T erinilm dy n(A), 


m 
—m]|2 
where the y„(4) process has orthogonal increments with 
E| dyn A|} = dF n(A)- 
Then, if t is a multiple of 1/m, 


2 
w erin 


1 
D-0 | iag W, 
=m]|2 


so that 
m2 


E{|æ(t) — (0) } = | 
—m/2 


and, more generally, if 51, h» Sz, f2 are multiples of 1/m, 
m/2 


mità | |2 
dF (A), 


e 
erim] | 


( e?it — e?r eer me eum 


Beira ) 
Efla) — as) 2G) = 2s) = | emf dF n0). 


—m]2 


— 


554 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


„e 


Then, if Hm is defined by 


a 

4n? 2 
H(A) | a+ Bj m pje Er 
-m2 x 


we have 
Effet) — elde) — 2(s2)]} 
ma (erita erish (E e BMA) 


B an (1+ 22) dH,,(A). 


—m/2 
It will be convenient to define H,,(A) for 4 > m/2 as H,,(m/2) and for 
A <—m/2as H,,(— m/2) = 0. It will be proved below that the sequence 
{H „} is bounded, and that 
dH, (2) 
l>a 
is uniformly small if « is large. By Helly’s theorem there is a convergent 
subsequence {H,}. Let H be the limit function. If sı, fi, Sa, fa are 
arbitrary, if in the preceding equation these numbers are replaced by the 
nearest multiple of I/m, and if then m — œ along the sequence {an}, the 
equation becomes (11.2’). We must still prove the stated properties of 
the sequence {H,,} however. In the first place, if ¢ is a multiple of I/m 


we have 
m2 


144 
11.3 A — w nità 12 
(11.3) E(t) — x(0)|?} 3 k IP ape nA 
f sin? wth 
> | ae A nl) i 


4e f 
>$ [ano 
Fei 


for any «, if is so small that zx <}. In the second place, if N is any 
positive integer, 


1 x j 2 
rà tlel) -0f 
1 ae N 1 22 } 
ar Æ 
pen, 2rijaj/m _ 1 |2 
3 E le 1 oe Hn) j 
~m; 
m2 


39 | f — sin 7A/m + sin [(N + orm 1+7 


2N sin 7A/m An? dH al) 


$11 STATIONARY (WIDE SENSE) INCREMENTS 555 


so that, if N/m = t > 1/a, 


«(.) no} J fit ah doen 


loa 


1 
a f ao, 
j>a 
Since by our continuity hypothesis the left side ofthis inequality approaches 
O when t —> 0, the right side must go to 0 (uniformly in m) when œ — %. 
Moreover, (11.3) and (11.4) taken together imply that the sequence {Hm} 
is bounded. This completes the derivation of the stated properties of 
the sequence {Hm} and thereby of (1 12); 
Before deriving (11.1) we discuss integrals of the form 


| Pode, 


x 
(11.4) TE 


where the 2(f) process has stationary (wide sense) increments with (11,2’) 
valid, and f* is a fixed function. In accordance with the usual procedure 
we define the integral first in the obvious way for simple step functions 
f*, that is, those which vanish for large ¢ and take on only a finite number 
of values, each on an interval or on finitely many intervals, For these 
functions, if we continue our previous convention relating starred and 
unstarred functions, so that 


fa) = f eet f*(t) dt, 


-0 


we have, from (11.2), 


o 


(11.5) E(| j Odol) = f [faa + 2) aH), 


-0 


and, more generally, 


w 


a1.5) E| j f*(0) det) Í gO des) = | ORDA + #9) aH). 


Jo 
We extend the definition to a larger class of functions in such a way that 
(11.5') remains true. The functions will be Baire functions f*, integrable 


over all finite intervals, for which the limit 
A 
f@) = lim f fO dt 
Ao 4 
exists for all 4 and satisfies 


Cas 


(11.6) J IADA + dH) < © 


556 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


If distance between random variables is defined as root mean square 
distance, 
EM {|e — xa?) 

and that between functions /* is also defined as root mean square distance, 
with A weighting (1 + 2) dH(A?) between the corresponding functions f, 


[ [ a-oa +a ana)" 


the stochastic integral sets up a correspondence between certain functions 
of t (simple step functions) and random variables, which preserves linearity 
and distance [because of (11.5)]. Then the stochastic integral is defined 
by continuity (cf. a similar discussion in IX §2) for all f satisfying (11.6) 
and therefore for all f* of the class considered. Equation (11.5’) remains 
true since it is true for simple step functions. In particular, if the æ(z) 
process is stationary in the wide sense this stochastic integral reduces to 
that defined in §9. 
We can now derive (11.1%). If such a formula holds, and if 
A 
HO) = | A+ PP ayo, 


0 


(11.1) becomes, after a formal differentiation, 
o 
a(t) = f eriti (A) dh, 
so that we expect a reciprocal formula which is an integrated version of 


G@ = | ewo dt 


Actually we shall prove that, if (4) is defined by 
f pr2aitt_ 


40) = | 


-o 


— 2nit apie) 
[with possible normalizations to make (A +) = GA], then the (A) 
process has orthogonal increments, and if finally y(A) is defined by 


dilu) 


1, =) | a, 
(11.7) w= | ar 


§11 STATIONARY (WIDE SENSE) INCREMENTS 557 
the y(A) process has orthogonal increments, with E{|dy(A)|?} = dH(A) 
[assuming that H(A +) = H(A)], and (11.1’) is true. The above formula 
for (A) leads to = 
4 3 en 2riit L griat 
402) — Ga) = | 5 — ato, 
— Lait 


and this stochastic integral, like the one defining (A), has already been 
discussed. We have, from (11.5) and (11.5’), 


As 
F(C) — GUD} = | A +AA, a< 
A 


Efl) — yA] ye) — WD = 0, <A St < Pe 
if Ay, Ao» My» Mg are points of continuity of H(A). The IA) process thus 
has orthogonal increments for A restricted to continuity points of H, and 
we define it at the discontinuities to make (4 +) = g(). Then, if also 
HO +) = H(), we have E{\dg()/?} = (1 + #) dH(). Finally yA) is 
defined by (11.7), so that E{|dy(A)?} = dH). The (A) and (A) processes 
as defined satisfy 


ais) [AGO | OAD | O 


if (A) is 1 on a finite interval whose endpoints are points of continuity of 
H, and 0 otherwise. In fact, this is precisely the definition of the 9(A) 
process. This equation is therefore true (with probability 1) by the usual 
extension argument, for all f, f* for which the integral on the right is 
defined. In particular, if f* is 1 ona finite interval and 0 otherwise, 
(11.8) reduces to the desired equation (11.1). 

Once (11.1’) has been proved, it follows that the y(A) in this formula is 
essentially uniquely determined. In fact, if in the following we assume 
that (11.1’) is true, and that y +) = y), HO +) = H(A), then (11.8) 
is true for f*(s) defined as 1 on a finite interval and O otherwise. It is 
then true for all functions f, f* for which the integral on the right is 
defined. In particular, if f(A) is 1 on a finite interval, } at the endpoints, 
and 0 otherwise, (11.8) becomes 
GAs) t GAs ) Gy) t Wy ) es i ener priv de(t). 

2 2 — 2nit 


(11.9) 
Taking the expectation of the square of the modulus of both sides of 
(11.9), we find an expression determining H, 

| 


o z 
i en 2ritha L p- Pritts 


Ay 
(11.10) f (1 + 72) dHlu) = ef io det) 
dy z 


558 STATIONARY PROCESSES—CONTINUOUS PARAMETER XI 


if 4, and A, are points of continuity of H. The right-hand side can also 
be expressed in terms of the covariance function r(s,, f1; S2, t2)- 

Example 1 (Continued) If an ,(t) process is stationary in the wide 
sense, its spectral representation leads to 


t © 
ernn] 
[aoa | Eg HO HADY = ARA. 
2rih 
0 =o 
Thus in this case the x(t) process defined by the integral on the left has 
stationary (wide sense) increments, and in the spectral representation 


GLI 
(1 +4 dyl) = dy) 
(1 + 4) dH(A) = dF (2). 


Thus in this case f #2 dH(A) < œ. Conversely, if an æ(t) process has 


stationary (wide sense) increments, and if f 2? dH(A) < œ, (11.1%) can 
be put in the form Jo 


c3 


x(t) — æ(s) = Í 


ern sid e? 


À 
dA, A= | (+ pê) dylu). 


Since 
eit erish 


im 2a rt 
TC EL, 
where l.i.m. refers to å weighting (1 + #2) dH(2) on the 4 axis, it follows 
that 


Cs 


=a) = f et dy). 


-0 


Li.m. TOTA 
at t—s 


The x(t) Process so defined is stationary in the wide sense and will be 
called the derived process of the a(t) process. The a(t) process sample 
functions are almost all absolutely continuous, if the process is separable, 
and a(t) = a(t). (Cf. the discussion of Example 1, §9.) We have thus 
proved that a process with stationary (wide sense) increments has the form 
of Example 1, that is, the process has a derived process, if and only if 
J 22 dH(A) < co. 

We have seen in this chapter that an 2(r) process with stationary (wide 

sense) orthogonal increments satisfies E{|dz(¢)|?} = o? dt and acts in many 


ll STATIONARY (WIDE SENSE) INCREMENTS 559 


ways as if the derived 2’(f) process existed as a stationary (wide sense) 
process with spectral density o°. If the orthogonality hypothesis is 
dropped we have the processes under discussion in this section, and the 


formal version of (11.1’), 
a 


(y= | er + 2) dy’), 
To 
suggests that the most general process with stationary (wide sense) incre- 
ments acts as if the derived a(t) process existed as a stanonay (wide sense) 


process with spectral distribution function given by Í (1 + uê) dH(y). 
[This foncuon of 2 is not in general a proper spectral distribution function 


since f (1 + u?) dH(u) may be infinite.) This somewhat vague statement 
has exactly the same kind of verification as in the special case of orthogonal 
increments, and we omit the details. We remark, however, that the 
statement is true (with no fictions) if the derived process actually exists, 
and remark also that the formal differential equation 


B 
Pe i! 


can be interpreted for the general x(t) process with stationary (wide sense) 
increments in exactly the same way as in $10 where the increments were 
orthogonal. There is a stationary (wide sense) X(t) process which is a 
solution of this equation, with spectral distribution function given by 


XO) = a(t) 


Q + 48) dH(w), 
| | 5 Bw’ P 
0 


In particular, if the x(t) process has orthogonal increments the numerator 
in the integrand becomes const. du, as was worked out in detail in §10. 


CHAPTER XII 


Linear Least Squares Prediction— 
Stationary (Wide Sense) Processes 


1. General principles (discrete parameter) 


Let {£n —o<n< œ} be a stationary (wide sense) process with 
spectral distribution function F. In the following we shall describe as 


“F measure” the measure | dF(A) of 2 sets E. This is Lebesgue-Stieltjes 
E 


measure on the A axis, and the corresponding measurable sets and measur- 
able functions will be called the sets and functions measurable with respect 
to F. (They include the Borel sets and the Baire functions.) It will be 
useful to introduce a systematic notation for certain closed linear mani- 
folds (see IV §1 and §2) of random variables and of functions measurable 
with respect to F. In the latter case we use / weighting dF(A), that is, 
F measure is used in the integration defining the root mean square distance 
between functions which determines the closure property. In the list 
below M: - +} means the closed linear manifold generated by the elements 
or sets of elements in the braces. 


M, = Menj < n) „M= Me, j n) 
O H 
Ma = A M, _M =A M 


M,, = Miz, — © <j < oo} EM = M2", — o <j < o} 
= MEO M,} =M U M}. 


It is clear that the functions in „M consist of those in „M multiplied by 
eri(n—m)’_ The manifold „M consists of the functions ® which are 
1/2 


measurable with respect to F and for which J (DA|? dF) < %. In 
—1/2 

fact, these functions ® certainly define a closed linear manifold M, with 

EM CM since er" eM for all n. Conversely, an elementary Fourier 


560 


ae 


Ni GENERAL PRINCIPLES (DISCRETE PARAMETER) 561 


series argument shows that „M includes every function which takes on 

only a finite number of values, each on an interval, and since these 

functions are dense in M the closure property implies M C „M. 
Consider the problem of approximating ns» by linear combinations 

N-1 

S bæ,- minimizing the mean square error 

j=0 


N-1 
Efn — p> bnl} 


We have proved in IV §3 that if N is fixed there is a minimizing linear 
combination, which we denoted by S EAE ESE °° x,}, the projec- 
tion of «,,,, on the linear manifold generated by the random variables 
Watney © © ‘+ Vn If N is unrestricted we rephrased the problem as 
follows: it is desired to minimize Ef|2,..— plè} for pe M,,. In this case 
also there is a solution, B{x,,,, | Mn}, the projection of z,,,, on My In 
both cases the solution is unique, neglecting zero probabilities (see IV §3). 

In particular, if the process is real and Gaussian, and if Efx} = 0 for 
all j, the solutions to these minimum problems are respectively the con- 
ditional expectations of %n+y relative to the sets of random variables 
BnoNaae” © Up And? * *y Cni Cn It will be remembered that projections 
have also been called wide sense conditional expectations. 

The random variables B{any, |n-n+o °° > xp} and Blen [°° % 
X,-1, Ta} will be called respectively the linear least squares prediction of 
Eny in terms Of poyan ` * 9 Ën and in terms of the complete past to ta: 

We shall not discuss non-linear least squares prediction except to 
remark on the definition and existence of a solution, The problem of 
non-linear least squares prediction is that of approximating nsv by 
functions f(t, * * s ®n-naa)s minimizing 

E{|Xn4y—S@ns* * *» En-a) 


Here the function fis to be a random variable measurable on the sample 
space of @, wy '' > Yw and it is supposed that E{| f|} < ©. The 
admissible random variables (for n, N fixed) constitute a closed linear 
manifold, and there is a unique minimizing f, the projection g Of tpe ON 
this linear manifold. This solution is characterized by the fact that it is 
in the manifold described and that #4» —& is orthogonal to the manifold. 
These two properties characterize the function E(x,» |@aonep 7% Lp}. 
Hence g is simply the conditional expectation. Thus the random variables 


Plene |@n-w4* t EAE Eft n4 Ar RE Xp} 


are respectively the best linear and best (unrestricted) least squares 
prediction. The extension to predictions based on the full past to 7, is 
obvious, In particular, if the process is real and Gaussian with Efx} = 0 


= 


562 LINEAR LEAST SQUARFS PREDICTION Xl 


for all j, linear and unrestricted prediction coincide. In the language of 
II §3, linear prediction is general prediction in the wide sense. 

Note that according to IV, Theorem 7.4, and its strict sense version 
VII, Theorem 4.3, 


Lim. Hzn, [nip 5% = Pirn | M,} 
No 

lim Efens, | £nyin" + 5 Cat = EfXps, EEEE AA, 
No 


(the second limit holding with probability 1). These limit equations 
justify the use of predictions based on part of the past to approximate 
those based on the full past. 

Throughout the rest of this chapter we shall consider almost exclusively 
predictions based on the full past. We shall denote by p„, the linear 
least squares prediction of x,,, in terms of the past up to £, 


Pray = Ee, | M,,}. 


2. Linear least squares prediction as polynomial approximation 
The spectral representation of z,,, 
1/2 
2.1) t= | em dy(i), EJA) = AFO), 
—i/2 
where the y(A) process has orthogonal increments, will play an essential 
role in this section. We have seen in IX §2 that, if ® e owt, the stochastic 


integral ie 


(2.2) p= | Ady) 
—1/2 

defines a random variable with Ef|p|?}} < oo. For example, if ® = 6%‘, 
y =«,. It follows at once from the discussion in IX §2 that p e M., 
and conversely that to any p e M, corresponds a ® e „Mt such that (2.2) 
is satisfied. The random variable ¢ is determined uniquely by ®, neglect- 
ing zero probabilities, and the function ® is determined uniquely by p, 
neglecting sets of zero F measure. The manifold M, corresponds to „Wt 
and Mt_,, corresponds to „M. Any finite sum > y,«,, corresponds to 


rink wi 
F yne , with it 


Eflyne h= | |E yne edF). 
-i2 7 
Using this correspondence, the problem of finding pọ, can be put in 4 
language as follows. The function ®, e ¿Mis to be found which minimizes 
1/2 
f E DAFO, DeM. 


—1/2 


§2 PREDICTION AS POLYNOMIAL APPROXIMATION 563 


In other words, we are to approximate e**"” by linear combinations of 
the functions {e*"*", n < 0}. The equality 
N-1 ue N-1 
Efl, — > be P}= | e E bye” F(A) 
j=0 -ie j=0 
exhibits the complete equivalence of the formulations. The solution ®, 
is the projection of e”? on M, and is uniquely determined, neglecting 
sets of F measure 0. In the correspondence (2.2) we have 
1/2 
Yor = | D dy). 
-i/2 
More generally, it is clear that we have 
1/2 
(2.3) Guy = | EDA) dy(). 
1/2 

From one point of view the approximation problems discussed here, 
one in terms of random variables, the other in terms of the corresponding 
A functions, are not only equivalent but even identical. To see this 
suppose for simplicity that F(%) = 1. Define probability as the measure 
defined by the function F on the 4 axis. Then the sequence 
ferin — co <n < co} is a sequence of random variables which is 
stationary in the wide sense, with spectral distribution function F. The 
prediction problem for the stochastic prosess defined in this way is 
precisely the prediction problem for the x, process as described above, 
when translated into 4 language. 

The prediction problem in 4 language is closely related to classical 
polynomial approximation problems. In fact, consider the problem of 
minimizing |P(2)— 2| for fixed » > 0 when P runs through all poly- 
nomials of degree N — 1.* If “minimizing” is taken to mean “in the mean 
square sense on |z| = 1 with arbitrary weighting,” the problem is to 
minimize 

12 
{| | P(e?) — e27]? AFG) 


—1/2 
KRL 
for given v and F; if P(@@) = > 5,2’, this integral becomes 
0 

nts N-1 : N-1 
| e I bye dE) = Elfen — X binl} 
-i2 * 0 JEY 
Thus this polynomial approximation problem is identical with what we have 
called above the prediction problem using only a finite part of the past. 


564 LINEAR LEAST SQUARES PREDICTION XII 


The function ®, will be called the prediction function for lag v. If 
o,2 is the mean square error of the prediction, 
1/2 
OA o= Ehlen P= | le OW] FO. 
—ij2 
Since o,? is the lower limit of the error in approximating «,,,, by linear 
combinations of the random variables * * *, tn- Yn» whereas o, is the 
lower limit of the error in approximating #4, by linear combinations of 
the random variables © - *, %,,-2, n- We must have o,2< o,,,?. Hence 


0<ofsof<- ee 


The process is called regular if o° > 0, non-regular or deterministic if 
of =0. Inthe regular case, X,+; does not lie in M, so that the sequence 
of linear manifolds - © -, Mo, Wa, © + * is (strictly) increasing. In the non- 
regular case %,,, lies in M, so that these linear manifolds are all the same; 
each is M,,, and all the o,’s vanish. The o’s may not be strictly increasing 
in the regular case. In fact, if the z,’s form an orthonormal set, 1 = of 
=of = = ` » (see §3). The past completely determines the future in the 
non-regular case. This may also be true even in the regular case, but in 
the regular case the future cannot be determined completely by a linear 
operation on the past. 

We shall solve the prediction problem by finding Pn» and ®, explicitly. 
We observe that formally, if ®, is given by 


(2) = 


iMs 


bese 
Pn» is given by 
z 
Pny = a; btn- 
j=0 
which exhibits Pn, as a linear operation on the past. It may not be 
possible to express ®, as the sum of a nicely convergent series of this 


form, however, and Pn, may have to be expressed in some more compli- 
cated way also. 


3, Solution of the prediction problem in simple cases (discrete parameter) 
As a trivial example of the prediction problem we are studying in this 
chapter, suppose that the x,’s are mutually orthogonal, so that 
dF(A) = E{\x9|?} d2. 


Then 24» is orthogonal to M, so that Pn ,(0) = 0 and ©,(2) =0. Note 
that we have not assumed that E{x,;} = 0 in this discussion. 


§3 THE PREDICTION PROBLEM IN SIMPLE CASES 565 


Next suppose that the spectral distribution function F is absolutely 
continuous, with 


(3.1) F(A) 


z R A 0: 
Is Berar 
0 


B 
The polynomial >, By’ cannot vanish when |z| = 1, since F’ is integrable. 
0 


At the expense of changing the B,'s, without changing the value of the 
polynomial on |z| = 1, we can suppose that all its roots have modulus 
< 1 (see X $10). Then (see X §10, Case 2) #,,,, can be written in the form 


1 Ea 
(3.2) tn =- Be (Bpa®y +- + + Botn) + B 


where the &,’s form an orthonormal set, and each &,41 is orthogonal to 
the random variables + + -, %,-1 Vn, that is, &,41 is orthogonal to mM, 
The function 


1 ; 
(3.3) Pia So B, (Bpa®n Gaia T Bon—pra) 
B 


is then the prediction of ®p41 for lag 1, since as so defined pr €M, and 
Zaya — Pna = Enp Bp Js orthogonal to M,. The prediction function 9, 
is given by 
1 
,(4) =—- > Baers Ir Bete), 
By 


The mean square prediction error (lag 1) is 
Ensi ‘| See 
By |Bpl? 


To find the prediction for lag greater than 1, (3.2) can be iterated. For 
example, 


of =2| 


1 Ën 
Enp = — p cha fet + Botn-pral + Bp 


A S By-1Bp-s-1 — BpBp-i-2 ae By By 


= Be BP 


Tn-p+1 
Enta Boa 
a ae Be Eny 


This equation shows that Pn,2 is given by the terms on the right in 
A teen. 
é By 

ae n+2 pS nth 
Oy ef B, Be 


1 |Bp-al? 
= + . 
| |B? © |B;l* 


566 LINEAR LEAST SQUARES PREDICTION XII 


The case 8 = 1 (Markov process in the wide sense) has special interest. 
In this case (see X §3, Example 2) the formulas are particularly simple: 


1 


FO = a SS 
a) Ber" + BP 
fel» B, 
R(n) = -— a n>0 ca 
(1 — |el?)| Bi)? B, 


(3.4) 
Pny = CX, DA= 


1 
Ch ee ELET 2-2), 
Parent E T 
Note that Pp- » the prediction of x, with lag », goes to 0 when » > œ. 
In other words, the predicted value of x„ based on the remote past is 0, 
and the corresponding mean square error is R(0) = E{|z,|*}. 

The processes with spectral density (3.1) are regular and have the 
property that the prediction p,,, depends only on a finite number of the 
past 2,’s. Conversely, a regular process with this property must satisfy 
a difference equation of the form (3.2), where BoB; #0, E{|&,41|"} = 1, 
and &,,,, is orthogonal to M,. According to X §10 the z, process must 
then have an absolutely continuous spectral distribution with density 


B 
(3.1), and the roots of > B,z’ must have modulus < 1. 


0 
Obviously a process is deterministic and has the-property that @n1 
depends only on a finite number of past x,’s if and only if it satisfies a 
homogeneous difference equation 


(3.5) Bit n41 Pes a te Bo® nai =0, BoB; #0 


where we have supposed that ¢,,; depends on f past x;’s. A necessary 
and sufficient condition for (3.5) is that F be constant except for a finite 
number of jumps. In fact, 


(3.6) 
1/2 


E{| Bpen H + + + Botn-prl} = f [Bge P +- + -+ Bal? dF(A), 
—i/2 


If (3.5) is true, the integral in (3.6) vanishes, and this is impossible unless 
F(A) remains constant except at the (< ß) zeros of the integrand. Con- 
versely, if F is constant except for B jumps, By, * - -, Bz can be chosen to 
make the integrand in (3.6) vanish at each jump. Then the integral will 
vanish and (3.5) will be true. 


§3 THE. PREDICTION PROBLEM IN SIMPLE CASES 567 


The following example exhibits the result of combining a regular and a 
deterministic process. Let {u,,, — 00 < n < œ} be a stationary (wide 
sense) Markov (wide sense) process with spectral density given by 


1 


E/a = rcs [eles 
so that, by (3.4) with B, = 1, 
R = , = 
a(n) i— Tee n>0 


Puny = C&n 
OEA =1+ |el? fate ee ote fakes 


Let v be orthogonal to all the w,’s, with E{|v|?} = 1, and define a v, 
process by setting v, = v for all n. The spectral distribution of the v, 
process is concentrated at the origin, 


F,4)=0 A<0 
=] 420, 
and the prediction of v, with any lag is of course v,. Finally set 


x, = u, + V, 80 that the spectral distribution function F of the x,, process 
is F, + F,. To evaluate the prediction „1 of Xa, with lag 1 note that 


u 


Gera Uy Et Uno 
moo m+1 


= 0. 


This can be derived either from the law of large numbers for stationary 

(wide sense) processes (X Theorem 6.1) or by an explicit evaluation of the 

expectation of the square of the modulus of the above ratio. The random 

variable 

nt team 
w 


a CU FV 


3.7) p= een OTE 


lies in M,, and £p} — P = Uny — cu, is orthogonal to M,. Hence 
P = Pa 1, and we have proved that the prediction ¢,,; is the sum of the 
individual predictions cu, and v, relative to the u„ and v, processes. If 
the x,’s on the left in (3.7) are replaced by u;’s, the right side becomes 
cu,, the u prediction of u,,,,; if the x;’s are replaced by v;’s, the right side 
becomes v = »,,,,, the prediction of v,,,. Hence the calculation of the 
prediction can be made in the same way for the 2,, u,,, and v, processes. 


568 LINEAR LEAST SQUARES PREDICTION xu 


In other words, the prediction function can be chosen to be the same for 
all three; explicitly 
R l4 s4 enm 
0A) = | —c) Lim, ——— 
(A) =e + ( Alim TA 


where the 4 weighting for l.i.m. is dF(A), and therefore 
OA=c AO 
= A=0. 


We have already observed in (3.4) that ®,(2) for the u, process should be 
identically e. There is no contradiction here since, in terms of dF,(A) 
weighting, a change of ®, at one point makes no difference (that is, each 
point has F, measure 0). Of course, the linear operation (3.7) is need- 
lessly complicated for the prediction of the u, process, corresponding to 
the fact that ®, is needlessly complicated for the prediction function of 
the u„ process. The point is that, as defined, ©, is the prediction function 
for the x,, u,, and v, processes, and it will be seen in §4 that this is 
characteristic of the general case. Finally note that the prediction error 
with lag » of the x, process is 


of = 1 + lett. °° + lel”, 
the same as that of the u, process, but when » — co this approaches 


< I+ am Elle 


LF 
1— |e} 1— |e? 


As a last example suppose that F is an absolutely continuous spectral 
distribution function with density F” of the form 


F(A) = 5 center ie #0, 
o 


where the series converges uniformly, and suppose that > cz” converges 
o 

uniformly for |z| = 1— £ for some e> 0 and does not vanish there. 

For example, we have seen in X $10 that, if F” is a rational function of 

e*, it can be written in the form 


Sag" 
F'(A) =|4—|+ — 49BpA,By #0 
$a 


4 GENERAL SOLUTION (DISCRETE PARAMETER) 569 


where z = e?"® and where the denominator polynomial does not vanish 
for |z| = 1. Then the above hypotheses are satisfied in this case. Under 
the above hypotheses we can define the prediction function ®, by 


s cet 

0,2) = eit 
> c F eaii 
j=0 


In fact, as so defined, ®, € M since we can put z = ein 


PC T. k 
dr =+ Dag, ll zt—> 
> og a 

j=0 


where the series converges uniformly, and e**"'— ®, is orthogonal to 
oM [A weighting dF(A)) because 
1/2 
[et — @ (ayer F(A) 
1/2 
3 vol ow 
= f em S cent S cee di = 0 
-it s b 
if m > 0. : 

Thus the prediction problem is solved once F’ can be put in the form 
of the preceding paragraph, Much of the next section will be devoted to 
an examination of the implications of such a representation of F' with 
weaker convergence hypotheses. 


4. General solution of the prediction problem (discrete parameter) 

THEOREM 4.1 If {ën — © < n < o} is an orthonormal sequence of 
random variables, if $ \eal® < 2, % + 0, and if z, is defined by 

0 
(4.1) Za => Cnt 
0 

then the x, process is stationary in the wide sense and has spectral density 
|S ce 2, The process is regular and the mean square prediction error 

o 


o for lag v satisfies the inequality 
(4.2) 0,2 = |colt +* + ++ |er-al? 


570 LINEAR LEAST SQUARES PREDICTION XII 


More generally, (4.2) remains true if instead of supposing that x, is given 
by (4.1) it is only supposed that the x, process has spectral distribution 
function F (not necessarily absolutely continuous) with 


(43) F= 


A R 2 | 
2, Cee Cy 0s 2 |cn|? < 00, | 
n= | 


for almost all 2 (Lebesgue measure). In either case there cannot be equality 
in (4.2) for any v if > c,2" has any zeros in |z| < 1. 
0 


We observe that the series for x, in (4.1) converges in the mean, by IV, 
Theorem 4.1. According to X §8 an x, process can be written in the 
form (4.1) with the stated conditions on the c,,’s if and only if the spectral 
distribution is absolutely continuous and F’ is given by (4.3). 

We prove (4.2) for v = 1; the proof in the general case is similar. The 
mean square prediction error of any finite prediction sum for lag 1 is 
easily calculated, and we find 


N 1/2 X 
Ejler — > bnl} = Í jerin — 5 bjeri»? dF(A) 
j= j=l 


N © 2 
S Í |i mS bet [5 cen] dì 
=i/2 j=1 0 
1/2 
= J [Co + const. e-®" + const. e7" +- + |? dà 
~i/2 


= loo? ++» > = [e 


Thus o,2 > |c]? Now suppose > cj’ = 0, |z| <1. Then we shall 
0 o 
prove that o,2> |co|?. In the first place, since > ĉ;Zọ = 0, 
0 


: 
Ge =E-A)T" 4 =—2 
0 


oMs8 


so that, going to |z| = 1, 


oMs 


A 
Gena = (eum %) S č Eri 
0 


where both series converge in the mean. Moreover, 


§4 GENERAL SOLUTION (DISCRETE PARAMETER) 571 


æ 


tee 5 geri 


E 
ert _ 3, 


EJ 
| >, cen jh 
0 


2 = 
=| eer 
0 


2 o 
= | i Beles Sy Crees za y Epes 
0 0 
E 
r C 
X Seremi 2 (at co, 
0 Zo 


o 
where the last series converges in the mean, so that > |c"? < 0. Now 
0 
© 
= ” 
Ëa = 2 CF Ena 
j=0 


defines a process with the same spectral density 
2 


2 


a 

ie 

oes 
0 


= f 
» cfes 
0 


as the a, process. Hence it has the same prediction error, and by the 
inequality already proved i 
o> lert = Sh > lel 
0! 

This finishes the proof of the lemma. 

We now show that any regular process can be decomposed into a 
regular and a non-regular process of canonical types. Let: * *,%,%," °° 
be the variables of a regular process, and define &,, by 


(4.4) En— Para Ën A= E]n — Pa, rl} > 0. 


Since 2, — Pn-1,1 is orthogonal to M,» it is orthogonal to every 
Lm — Pm-1,1 With m < n. Thus the &,’s form an orthonormal set, and 
we can write x, as the sum of a Fourier series in the &,’s and a remainder vn, 


Ly = Up + Uns Un = 5 Cika- Èi c|? < œ, 
(4.5) Po a 0 | il 
G= E{x,§,-3} (Co = %)- 


Thus u, is the projection of x, on the closed linear manifold generated 
by the &,’s and v, is orthogonal to this manifold. From the definition 
of &,, En €M,; hence u, and Va = %,— Un are in M,. To recapitulate, 


EE nen} = Om 

EE On} = E{u, Fn} = 9, all m, n, 

ELE, m} = 0, m<n, 
o Ered 


572 LINEAR LEAST SQUARES PREDICTION XII 


The geometrical significance of (4.5) can be seen as follows. The 
orthogonality and inclusion relations just noted imply that £, is orthogonal 


to M,a- Then, if we M- = A M,, w is orthogonal to every &,,. 
Ea 


Conversely, suppose that, for some n, w e M, and that w is orthogonal to 
every Ëm Then w is in the closed linear manifold generated by ¢, and 
the random variables in Mt,,_, (this is one way of describing M,), and, 
since &, is orthogonal to both w and M, it follows that w « Wt, 4. 
This argument is, in symbols, 


w = Bow |} = Bw | Mya. En} = Bbw |My») + bw |g} 
= Bw | Mya} Mp. 


Repeating this argument, it follows that w e W for all j, that is, w « M.. 
Thus, for each n, M- consists of all the random variables in M, orth- 
ogonal to Ep», En- * * *; that is to say, the manifold M_,, together with 
the one-dimensional manifolds of the multiples of En, Ën- °° ° are 
mutually orthogonal and generate M,. Equation (4.5) expresses x, as 
the sum of its projections on these orthogonal manifolds. The term »,,, 
the projection on W», is the contribution to x, from the remote past. 
Let W, n C M-a be the closed linear manifold generated by vp, Vn-1» 
We prove that M,,,, = M- for all n by proving that, if w e Moo 
w is its own projection on W, „. In fact, since the £s are orthogonal 
to the v,’s, 


(4.6) Efw | vn Oras + p= Efw | On Ona t s Em Ena + h 


with probability 1. According to (4.5) the closed linear manifold gener- 
ated by the v,’s and &,'s for j< n is M,. Hence the right side of (4.6) 
is E{w | M,,}, and this is w because w e M_n C M,,. 

A similar argument shows that ,, lies in the closed linear manifold 
generated by un, Up- ° * * We give the argument in geometrical terms 
this time. The random variable é,, lies in M,,, which is the closed linear 
manifold generated by Un, Ups * * "s Uns Una’ * 's by (4.5). Since the 
v;'s are orthogonal to &, and to u,,, u,4,* * *, they are superfluous here. 
Thus the closed linear manifold generated by up, u,_1,° * * is that generated 
by Em Èn aes 

The u, and v, processes are stationary in the wide sense. Since 
Deyn = M, n- (= M-o), Va is in the closed linear manifold generated by 
Va- Ung) * *3 in other words, the v, process is non-regular. Let F, 
and F, be the spectral distribution function of the u, and v, processes 
respectively. Since x, = u, + Va and since the us are orthogonal to 
the v;’s, 

F=F,-+- Fe 


$4 GENERAL SOLUTION (DISCRETE PARAMETER) 573 


N 
This orthogonality also means that any approximating sum > b£n; used 
in prediction theory splits into orthogonal sums, Iaf 


N N N 
È bitni = È bini t È bUn- 
j=1 j=1 j=1 

one sum involving only the u,, process, the other only the v, process. We 
shall show that a single function ®, will serve as the prediction function 
of the a,, the up, and the v, process, so that a linear combination as above 
which is very nearly the best approximation to «,, splits into the sum of 
the same linear combination of u,’s giving nearly the best approximation 
to u, and the same linear combination of »,’s giving nearly the best 
approximation to Va. 

The prediction ¢,-,,, Can now be written down explicitly, 


(4:7) Pa, y = Efe, | May = È iEn- + Yw 
j=" 
In fact, the right-hand side is a random variable in M,,-,, because 
Ens € M5 CM» fey 
V, € Mo C Mr» 
and, using this definition of Pn», » 


v1 
En — Pnr,» 7 Deka 
j=0 


which is orthogonal to M,_,._ The same reasoning shows that the infinite 
sum on the right in (4.7) is the linear least squares prediction of u, in 
terms of ün Unai > The mean square error in predicting u, is 


the same as that in predicting £p 
v1 

(4.8) op = Ef] > cén} = lool? + °F |c,-a|*. 
j=0 


If ©, is the prediction function of the v, process, for lag », 
1/2 
o2= | ler — 0,0)? dF) 


—1/2 


a) 1/2 1/2 
= | (E D| dF A + | Je — D|? dF). 
—ij2 -i/2 


Now, since ®, is by definition in the closed linear manifold Mt generated 
by the functions {e"”, j < 0} with 2 weighting dF(A), ®, must be in the 
closed linear manifolds generated by the same functions but using the 
smaller weighting dF,(A) and dF,(/). Hence the two integrals in the 
second line of (4.9) are at least equal respectively to the mean square 


574 LINFAR LEAST SQUARES PREDICTION xu 


prediction errors for lag v of the u, and v, processes, namely o,” and 0. 
It follows that the first integral is o,? and the second one 0. In other 
words, ®, is simultaneously the prediction function for lag » of the x, 
the u, and the v, processes. 
The &, process is stationary in the wide sense with spectral density 
identically 1. If we write €, in its spectral representation, 
1/2 
n= | erm dy), Eldy} = d, | 
—1/2 
where the y.(A) process is a process with orthogonal increments, we find 
© 12 © 
GIO y= Satay | (S oye Pe) dy 
j=0 j=0 


-i/2 


If we write v, in its spectral representation, 
1/2 
v= Í enink dy (A), E{|dy A|} = dF,(A), 

—i/2 
the y, increments are orthogonal to the y; increments. The spectral 
representation of x, can now be put in the form 

2 
Ep = un + Py = l a LS ye 2 dyla) + dy (A) 


so that the y(A) of the spectral representation (2.1) of a, is given by 


À 
wi) = | LE eye dyl) + dyw. 
—1/2 17 
We now can find the 4 functions corresponding to &,, u,, Up, and Py, 
under the correspondence (2.2), and incidentally evaluate F, If &,, 
corresponds to ®, 


Ën 


1/2 


[ OOS, cer dy) + dy) 


I 


| ezrin’ dyA), 
-1/2 
we find, on comparing integrands, that 
(A) > jen = gini 
j=0 


for almost all 2 (Lebesgue measure) and 
@(4) = 0 


$4 GENERAL SOLUTION (DISCRETE PARAMETER) 575 


for almost all 2 (F, measure). These two conditions are incompatible 
unless there is a 2 set S of Lebesgue measure 0 such that 


J aED = F FC 

5 
that is, unless F, is a singular monotone function. In the following we 
shall call S the set of increase of F,. Since the expression for u,, shows 
that the u„ process has an absolutely continuous spectral distribution with 


density (see X §8), given by 
FQ =| Zeer", 
oma 


we have thus proved that the decomposition of x, into a regular un 
process and a deterministic v, process corresponds in terms of spectral 
distributions to the representation of F as the sum of its absolutely con- 
tinuous and singular components F, and F, respectively. The function 
® corresponding to Ëp is uniquely determined, neglecting values on a set 
of F measure 0, by the above equations. We can take 


ezrink 
NON =e 245 
2 jer 
=0 AS. 
Tt follows that we can take as the 2 functions corresponding to Up, Un» 
Pry 
j erim, A49, 0, 248 
Un? Vn: 
0, A€8, erin LES, 
5 Den 
j= Ady Pus! 
5 je 2 
=0 
Pn—vy* s 
erini 16S, 


and, since ®,, the prediction function for lag 7, corresponds to Yo, »» 


S ce 

(4.11) (A) = et”? 3 ep ergs: 
> Cet 
j=0 


= E, a € S. 


576 LINEAR LEAST SQUARES PREDICTION xII 


Thus u, and v, can be expressed in the form 


u, = | dY), vn = | E dy, 
S sS 


where 3’ is the complement of S in [— 3, 3]. In X §5 we have discussed 
the decompositions of a process into processes with disjunct spectra and 
shown how they can be accomplished by linear operations. We have 
now shown that in the regular case the decomposition separating out the 
singular component of the spectral distribution can be accomplished by 
a linear operation on the past and corresponds to separating out the 
deterministic component of the process. 

We observe that ¢,,_,,., the prediction of x, in terms OL tarp taser > 
becomes v, when »—> œ. Only the deterministic component can be 
predicted from the remote past. As would be expected, 

lim o,2 = Ef|u,|®. 
yen 

In the decomposition of a regular process the regular component must 
be present, but of course the deterministic component, the v, process, 
may not appear at all, that is, we may have t, = Un: 

The decomposition can be summarized as follows: 

THEOREM 4.2 Let {x,, — © < n < ©} be a regular process. Then Xy 
can be written in the form 


o 
Bn = È Céng F Un = Un T Uns 
j=0 
where 


o 
Siole 0, >, 
0 


EE nent = Om, n Efé mIn) =0 all m, n, 
En Ma Vn EM 


There is only one sequence of constants {c,} and only one sequence of random 
variables {&,} satisfying these conditions. 

In this representation the u, process is regular and the v,, process is 
deterministic. The prediction Pa—v,» is given by (4.7), and the prediction 


v1 
error is 0,2 = > |c)|2, which is the same as that for the u,, process. The 
0 


spectral distributions of the u, and V, processes are respectively the abso- 
lutely continuous and singular components of the spectral distribution of 
the x, process. 


§4 GENERAL SOLUTION (DISCRETE PARAMETER) 577 


All that remains to be proved is the uniqueness of the c;’s and é,'s under 
the indicated conditions. These conditions imply that c; = E(t Eni}: 
Moreover, #1, must be given by 


2 
Pn-1,1 = 2, Cjn- + Om 
J2 


since as so defined p11 € Wt, and Lp — Pn—1,1 is orthogonal to M4. 
Then ta — Pa-1,1 = Cofn is uniquely determined by the process. Since 
the process is regular, |co|? = o? > 0; cois then uniquely determined by 
the hypothesis that it is positive, and &, is thereby also uniquely deter- 
mined, that is, neglecting values on a set of probability 0. Finally c; for 
j > 0 is now uniquely determined by the Fourier coefficient formula 
already written. 

The theorems of IV §6 make it possible to put the preceding results in 
an analytic form. 

THEOREM 4.3 A process is regular if and only if F “(2) > 0 for almost all 


A (Lebesgue measure) and 
1/2 


(4.12) J log FA) dt > — o. 

-i/2 
In the regular case the constants {c,} of Theorem 4.2 satisfy and are uniquely 
determined by the conditions 


1 HË ograja 
2 oi 
o= e -112 : t 
© æ 
(4.13) Seo DB eye” FO (z| < 1) 
0 0 


F(A) = 5 Cren ema 
0 


the last to hold for almost all 2 (Lebesgue measure). These constants can 
be calculated by using 


= 1/2 N 
zlog F(A) ~ > a,e2""" (that is, a, = $ Í log F’()e°""™" da) 
(4.14) N ak 


EA 

o 

3 G2" a PAO ae i| 2i: 
0 


Note that 0,2 = c? has now been evaluated explicitly. This theorem 
now requires little proof. We have already seen that in the regular case 
F, is the absolutely continuous component of F. Then 


(4.15) FO = Fil) = | Soe 
0 


578 LINEAR LEAST SQUARES PREDICTION xit 


for almost all 2 (Lebesgue measure). Hence by IV, Theorem 6.2, 
F(A) > 0 for almost all å and (4.12) is true. Conversely, if F(A) > 0 for 
almost all å and if (4.12) is true, F’ can be written as the square of the 
modulus of a series of the type in (4.13) according to IV, Theorem 6.2, 
and the regularity follows from Theorem 4.1. Thus (4.12) is necessary 
and sufficient for regularity. 

According to Theorem 4.1 the constants {cen} of Theorem 4.2 must have 


o 
the property that > c„2" # 0 for |z| < 1, since the mean square error of 
0 


the a, and u,, processes satisfies (4.2) with equality. To derive the evalua- 
tion of cp in (4.13) we note that if (4.12) is satisfied we can, according to 
IV, Theorem 6.2, write F’ in the form 


F@=|Sde"~, Zla o, 
0 0 


with 
1 12 
g 
dy = eo 


2 
logF’(A) da 
2 


But then, according to Theorem 4.1, the prediction error of = c for 


lag 1 of the u, process, a process with spectral density F’, is at least dọ’. 
Hence cy > dy. On the other hand, according to IV, Theorem 6.2 [see 
IV, (6.7')}, the reverse inequality is necessarily true, so that cy = do, as 
was to be proved. Thus the conditions (4.13) are satisfied. Conversely, 
according to IV, Theorem 6.2, the conditions (4.13) uniquely determine 
the c,’s and they can be calculated explicitly by (4.14). 

Note that, if z,, can be written in the form 


w a 
CA A = Cjba-4 Sle? <o, EEE} = Ops 
= Ja 
with not all c,’s vanishing, it is no restriction, shifting subscripts on the 
£s if necessary, to suppose that cy #0. The process is regular by 
Theorem 4.1. Since F has no singular component in this case, the 
process has no deterministic component. The c,'s in (4.16) may not be 
the uniquely determined c;’s of Theorems 4.2 and 4.3, however, even if 
the £s are chosen, as can always be done and as we shall suppose has 
been done, so that cq is real and positive. In fact, if (4.13) is not satisfied 
there will be a uniquely determined different sequence {cn'} for which it 
is, and there will be a different orthonormal sequence {£,,'} such that 


o 


(4.16) T= Ley, % > 
j=0 


$4 GENERAL SOLUTION (DISCRETE PARAMETER) 579 
If x,, can be written in the form 
w æ 
(417) &q = È Cb nis > jal < 0, EEE = oa 
j=-0 aoe 


the process has an absolutely continuous spectral distribution function F, 
with 


(4.18) F(a) = 


J 
> eer 
-0 


Since any integrable non-negative function can be put in the form (4.18) 
(the sum is simply the‘Fourier series of any square root of F’), the only 
restriction implicit in (4.17) is that F be absolutely continuous. The z, 
process may be regular, and then will have no deterministic component, 
or it may be non-regular. 

At the other extreme, if the x,, process has a distribution function which 
is singular, then the process cannot be regular, because F’(4) = 0 except 
on a A set of Lebesgue measure 0. It is interesting to observe that, 
although the deterministic component of a regular process is always of 
this type, this is not the general non-regular case. In fact, by Theorem 
4.3 the spectral distribution of a non-regular process may be absolutely 
continuous, as long as (4.12) is false. 

It will be useful to have an explicit representation of the linear manifolds 
M, and Ma in A language. The manifold M, has corresponding A 
manifold „M, the closed linear manifold generated by pam m <n} 
[A weighting dF(A)] and -„M = Q „M corresponds to M_,.. We observe 
that the definitions of , and _,,W depend on the choice of F but do not 
involve any probability concepts, and we shall phrase their description 
without involving these concepts. 

Tueorem 4.4 Let F be a monotone non-decreasing function for |A| <4, 
with F +) = FQ), F— 4) =0. Let F, be its singular component, and 
let S, of Lebesgue measure 0, be the set of increase of Fy. 

(i) If F(A) = 0 at most on a set of Lebesgue measure 0, and if 

1/2 
(4.19) | tog F’(a) a > — o, 
—1/2 
there is one and only one sequence {c;} satisfying (4.13). The c's are 
determined, for example, by (4.14). The manifold „M consists of all 
functions « which are measurable with respect to F, vanish for almost all 
A (Lebesgue measure), and for which 


1/2 
| aD ak < ©. 


-1/2 


580 LINEAR LEAST SQUARES PREDICTION Xi 
The manifold „X consists of all functions of the form 


emt S yea 
— 25 
> c g2} 
0 J 
al) AeS 


where « is as just described and > |y;|? < œ. 
0 


(ii) If the hypothesis of (i) is not satisfied, „M = „W for all n and this 
manifold consists of all functions B which are measurable with respect to F 
and for which 


1/2 
| AO aF(A) < oo. 
—1/2 


G’) If there are constants {c;} with > |¢,|? < œ such that (4.18) is true, 
0 


then, if „D is as described in (i), the c;s are the uniquely determined constants 
described in (i), aside from a multiplicative constant of modulus 1. 

Gi’) If sM is as described in (ii), (4.19) is false. 

The theorem is obviously true [with Case (ii)] if F(4) = 0, and in the 
following it will be convenient to exclude this possibility. If å is con- 
sidered a random variable, with distribution function F/F(), the x, 
process defined by i 

En = FAE 
is stationary in the wide sense with spectral distribution function F. The 
theorem thus becomes a theorem on stationary (wide sense) processes, 
and the previous theorems can be used to deduce it. Only the description 
of the manifolds „W and _,,M requires any comment. In the regular case 
we have seen that M- is the closed linear manifold generated by the 
v,’8, that is, -Wt is the closed linear manifold [A weighting dF(A)) 
generated by the functions 
0 245 
eA EN n=0, + Ju a 


and this is the manifold described in (i). The manifold M, is the closed 
linear manifold generated by M_s» and the £s for j< n, that is, „M is 
the closed linear manifold [A weighting dF(A)] generated by -Wt and the 
functions 

ew 


A 

—2rij 
> ceri 
0 


245, kxn 


0 fes. 


§5 GENERAL SOLUTION (CONTINUOUS PARAMETER) 581 


Then considering the functions only on the complement of S [where dF(A) 


weighting means |>, ce~??? da weighting], „M consists of all functions 
of the form s 
eA) 


S ce? 

0 
where y is a function in the closed linear manifold [A weighting dA] 
generated by {e27i", j < n}, that is, y consists of all functions 


æ o 
eS yet, X [p< o: 
0 0 
In the non-regular case „Mt = „Mt consists of all the functions in the 


closed linear manifold generated by {ej = 0, + 1, + + +}, and this is 
the manifold described in (ii). 


5, General solution of the prediction problem (continuous parameter) 


Let {a(t), — © < t < oo} be a continuous parameter stationary (wide 
sense) process. We suppose throughout that the continuity condition 


lim E{[x(t) — x(0)|?} = 0 
t—0 


is satisfied. Let G be the spectral distribution function of the process. 
As in the discrete parameter case, we introduce a systematic notation for 
certain closed linear manifolds of random variables and u functions 
[using u weighting dG(u) in the latter case]. 


Rr, s) = Mælt), r < t< $} (r, R = Mjet r <t <5} 
N, = N(— æ, t) N = (— 0, t) N 
Nig = AN, =A) 
No = Malt), — © < t < C aN = Mjet, — o < t< oo} 


The functions in N consist of those in N multiplied by 2&7". The 
manifold „N consists of the functions measurable with respect to G 


for which J [E| dalu) < 2. (See the corresponding descriptions 
in §1) 

The problem of predicting x(t) on the basis of a finite segment of the 
past or of the full past using linear least squares prediction is that of 


minimizin 
A El) —vP. vN 


—0 


582 LINEAR LEAST SQUARES PREDICTION XU 


where r, s are fixed and r < s < t. The solution is E{x(r) | N(r, s)}, and 
as in the discrete parameter case 


Lim. Bfe(s) IRC, 9} = Plet) |X}, 


so that prediction on the basis of a finite segment of the past becomes that 
based on the full past as the segment increases. Because of the continuity 
condition imposed on the process, N(r, s) is the closed linear manifold 
generated by the random variables 2(t,), x(t), © + +, where {¢,} is any 
sequence in the interval (s, t), dense in that interval. Thus it is possible 
to avoid explicit use of non-denumerably many random variables, but 
there is no reason to do so. 

The remarks made in §I on non-linear prediction go over to the con- 
tinuous parameter case without modification. 

We shall evaluate (s, t), the prediction of x(s + t) with lag ¢ based on 


the full past, Ans ies Piels +O IR). 

This prediction problem will be solved by reducing it to discrete parameter 
prediction. To avoid confusion we shall always use F and G respectively 
to denote discrete and continuous parameter process spectral distribution 


functions. 
The spectral representation 


(5.1) at) = f er dy(u), — EX\dy(y)|*} = dG), 


induces a correspondence between random variables and w functions (see 
§2) in which 2(t) corresponds to e?”", and more generally, if y corre- 
sponds to ‘V'(y), 


(5.2) v= | Eo) dyw. 


In this correspondence we%,, is uniquely determined by Y e „M, 
neglecting zero probabilities, and ¥ is uniquely determined by y, neglecting 
u sets of G measure 0. In particular, Ñ, corresponds to M for 
-osts o. 

Instead of stating the prediction problem in the language of random 
variables it can be stated in language: the prediction function ¥’, is to 
be found, a function in oN which minimizes 

J len — wal? aa(u) 
-0 
for functions ¥ in pt. We have 
0 
(5.3) Ws, D= f ew Cu) dy). 


-o 


— 


§5 GENERAL SOLUTION (CONTINUOUS PARAMETER) 583 


The function ¥, is the projection of e?" on of. If oè is the mean 


square prediction error, 


GA of = Elles + )— hs, DA= f fer — FW)? dF. 
-> 

As in the discrete parameter case (see §2), o is monotone non- 
decreasing in ż, and either ¢? = 0 or a? > 0 for all ż > 0. The first case 
will be called the non-regular or deterministic case, the second the regular 
case. To each continuous parameter process with spectral distribution 
function G we make correspond a discrete parameter process with spectral 
distribution function F, where 


FA) =G), A= 1 arctan u 
m 


Here 
e] 


emia _ l+ ip the = 
e T u = tan må i wy 

The manifolds „M and „N correspond to each other under this change 
of variables. The critical closed linear manifold for prediction purposes 
in the discrete parameter case is Wt, defined in §4 as the closed linear 
manifold of A functions, |4| < $, generated by the sequence {e?”"", n < 0}, 
A weighting dF(A). In the continuous parameter case the critical prediction 
manifold is N. The manifold of å functions oM goes into a manifold of 
functions N, the closed linear manifold [u weighting dG(,)] generated 


by the sequence > 
(Eso 
l1— iu 


We prove that % = N. In the first place the representation 


0 
l — ių 2 24 tH 92 
eee ee | dn ith e2at dy 
[tie thie E f: ; 


l 
approximated boundedly and uniformly in every finite j interval by 
linear combinations of the e****’s involving only values of t < 0. Then 
NCN. In the second place, if t < 0 the function 


z-1 
z+1 


vey od pha 
shows that (i+) , and therefore (; i w) for all n < 0, can be 
=y = 


2nt 
e 


is regular for |z| > 1, with modulus < 1. Hence it has an expansion in 
non-positive powers of z = |z|e®™®, valid for |z| > 1. For each 1<0 


584 LINEAR LEAST SQUARES PREDICTION XI 


and |z| > 1 this function is therefore a function of A in oM which becomes 
a function of in gt. When |z| —> 1 with 4 fixed, this function becomes 


oni E 
e2rit tan mà — g?ritu 


e 
if 4#% Hence, if t< 0, ¢” can be approximated boundedly 
by functions in Ñ. Then ee if t<0, so that NCN, and, 
combining this with the previously obtained reverse relation, we find 
that N = N. 

Now a discrete parameter process is deterministic if and only if 
„M = „M for all n, and it is sufficient if this holds for a single value of n, 
sayn=0, Similarly, a continuous parameter process is deterministic if 
and only if oN =N. Since pM and „M go into N and N, we have 
obtained the result that a continuous parameter process is deterministic if 
and only if its corresponding discrete parameter process is deterministic. 

This leads at once to an analytic condition for regularity corresponding 
to (4.12) in the discrete parameter case. 

THEOREM 5.1 A process is regular if and only if the derivative G’ of its 
spectral distribution function vanishes at most on a set of Lebesgue measure 
0 and 


i log G'(u) 


lF ê du >— œ. 


This theorem follows from Theorem 4.3, since 


; log [C'al + 2°)] 
log F’(A) dA = 
FORA “+e 


so that the integrals 


1/2 œ 
7 log G'(u) 
log F’(A) dh, -> du 
a: Al 1+ ê 


always are finite and infinite together. 
Before proceeding further we find the u counterpart of the 4 Fourier 


serjes 
o 


yd) = > petrii, 2 Iyl? <0, 


j=0 
Using the relation between 4 and u, we find that 


_A) LECET wey 
Val tin) W550 £ (+ in 


§5 GENERAL SOLUTION (CONTINUOUS PARAMETER) 585 


and since, for properly chosen Ap, ` ` *, Aj 
(1—ipy i A 


(5.5) ae = 
Jatt Lig FE 
= 0 
=f EDOM, f gripei a 
r=0 r! 
= f et fedi, 
where atk 
ce Pas 4 rH 
fi) =e > Cm A t t<0 
. r=0 rl 
Si t>0, 


we have finally 


yA) Be Ri cates re renee 
IN Sy, | eri dt— ||| et" > yf @idt 
Jai ee 
Since the 2 functions {e2"} form an orthonormal sequence in [— 4, 2] 
(weighting dA), the u functions 

{= ae 

val +i 

form an orthonormal sequence in (— 0, 99) (weighting du). The f;’s are 
thus Fourier transforms of the functions of an orthonormal sequence and 
hence (Parseval identity) themselves form an orthonormal sequence in 


(— œ, 00) (weighting dt). The series > yf; therefore converges in the 
1 


mean, and the above operations were all legitimate. The fact that the 
Fourier series for y involves no e?“ with j > 0 corresponds to the fact 
that y(A)/(1 + iu) as a function of u is the Fourier transform of a function 
vanishing for positive values of the argument. = 

In the discrete parameter case a condition of the form Pe # 0 for 


2 = 


o 
|z| < 1 was important. The c,’s were constants with > |c|?< o. If 
we make the transformation g 

1—iw 
£4= 


this condition becomes 


586 LINEAR LEAST SQUARES PREDICTION xil 
Now define 
SOS, = | e e) di 
0 -0 


The function f; was defined above so that 


(1—iwy Eo 
Ae aa 


for w real, This relation then obviously holds for J(w) < 0 also so that 
the condition > cz’ + 0, |z| < 1, becomes 
0 


f ertet c(t) dt ~0, JW) <0. 


=% 


[Since e*(f) = 0 for t = 0, the integrand is exponentially small when |t| 
is large.] 

We can now obtain the desired prediction theorems in the continuous 
parameter case. It will be convenient to derive them in an order different 
from that of the corresponding theorems in $4. Corresponding to 
Theorem 4.4 we have: 

THEOREM 5.2 Let G be a bounded monotone non-decreasing function, 
on (—, ©), with G(u +) = G(u), G— 0) = 0. Let G, be its singu- 
lar component, and let S, of Lebesgue measure 0, be the set of increase 


of Gy. 
(i) If G'(u) = 0 at most on a set of Lebesgue measure 0, and if 


T log G’ 
(5.6) Sar du >- o, 


there is a Lebesgue measurable function c* such that 


c(H=0, t20, jj |c*(O|? dt < 00, 


(5.7) Í erdt £0, Jw <0, 
ol od a 
tos f (i)e tar] a a eae du 


a ayy 


§5 GENERAL SOLUTION (CONTINUOUS PARAMETER) 587 


(where the integral on the left is real and positive), and that, if c is defined by 


eu) = | ettor dt, 
then Fi 


(5.8) Gu) = | leed + G. 


The function c* is uniquely determined, neglecting sets of Lebesgue 
measure 0, by these conditions. 

The manifold „N consists of all functions f which are measurable with 
respect to G vanish for almost all u (Lebesgue measure) and for which 


f pGd) 4G,(u9) < 00. 


The manifold N consists of all functions of the form 
i} y(sye27h* ds 
=- wads 


2 0 
geru’ 


c(y) 
p), ues, 


o 


where B € „N and Í Iyo]? ds < o. 


0 
(ii) If the hypothesis of (i) is not satisfied, -Nt = D for all t and this 
manifold consists of all functions B which are measurable with respect to G 
and for which 


J (aud dat) < &. 


G’) If there is a Lebesgue measurable c*, with Í |e*()|? dt < 0 and 


c*(t) = 0 for t > 0, if (5.8) is true, where c is the Fourier transform of c*, 
and if Mis as described in (i), then c* is the uniquely determined function 
described in (i), aside from a proportionality constant of modulus 1. 

GY) If -N is as described in Gi), (5.6) is false. 

The representation of G in (i) is the exact counterpart of the representa- 
tion of Fin Theorem 4.4. We have already observed that oM corresponds 
in u language to Ñ. Then by Theorem 4.4 the description of N in the 
regular case is at least correct for t= 0. It is correct for all t since N 
consists of the functions in oÑ multiplied by e°". The rest of the 


588 LINEAR LEAST SQUARES PREDICTION XI 


theorem is obtained by translating Theorem 4.4 from 4 language to p 
language. The last condition in (5.7) is equivalent to 


o 


ler lp) je yd fi G'(u) 
Ice 2a | 1 nee 2r. 1+ ag 


Corresponding to Theorem 4.2 we have 
THEOREM 5.3 Let {a(t), — © < t < ©} be a regular process. Then 
x(t) can be written in the form 
0 
(5.9) a(t) = f cts) dee s) + oO) = le) + (0), 


where c* is Lebesgue measurable, c*(t) = 0 for t > 0; 
0 
f |c*(t)/P dt < 00; 


the &(t) process has orthogonal increments with E{|d&(0)|?} = dt; every &(t) 

increment is orthogonal to every o(s); &t,)— Elh) N if t, hh 

v(t) eN- Only the functions proportional (with constant of proportionality 

of modulus 1) to the function c* of Theorem 5.2 (i) satisfy these conditions. 
In this representation the u(t) process is regular and the v(t) process is 

deterministic. The prediction y,_,,, is given by 

(5.10) Your = | cts) de +5) + 0, 


=o 


and the prediction error is 


0 
(5.11) 0} = | ("9| ds, 


which is the same as that for the u(t) process. The spectral distribution of 
the u(t) and v(t) processes are respectively the absolutely continuous and 
singular components of the spectral distribution of the x(t) process. 

To derive this representation of a(t) we use the spectral representation 
(5.1). Define G, as the singular component of G, let S be the set of 
increase of G,, of Lebesgue measure 0, and let S’ be the complement of 8. 
Define u(t) and v(t) by 

u) = | ert dylu) 
A 


v(t) = f eit dylu). 


§5 GENERAL SOLUTION (CONTINUOUS PARAMETER) 589 


Then every u(t) is orthogonal to every v(t’). The u(t) and v(t) processes 
are stationary in the wide sense, with respective spectral distributions the 
absolutely continuous and singular components of that of the x(t) process. 
Since the a(t) process is regular, Theorems 5.1 and 5.2 imply that there 
is a pair of functions c*, ¢ as described in Theorem 5.2. According to 
(5.8) the u(t) process has spectral density \e(u)|?, and according to X §8 
this process can therefore be expressed as a process of moving averages 
of the form in (5.9). More specifically, we can write u(t) in the form 


u(t) = | er d= J eS) ak + 9) 


where the &*(u) and &(t) processes have orthogonal increments with 
Eide} = du, FAJE = at 


and the &(f) process is the Fourier transform of the &* process. Here 
£*() is uniquely determined, neglecting values on a set of probability 0, by 


E(u) = ij oa 
(=%, JS’ ud 


Then &*(«) and &(t) increments are orthogonal to every vs). For 
5, < s < t define ¥ by 


st 


(5.12) Vu) = eur dr, pes 


= ues. 


Then, by Theorem 5.2, ¥ e Nt and therefore the random variable 


C 


(5.13) y = { F(u) dylu) = | 


brite Pris 


— dë* (u) = Elsa) — (51) 


2riu 
belongs to N, If I’ is defined by 
T(u) = 0, ues, 
Teke ES: 


T e_N, and therefore 


J EoD dy) = | er dyo = 00 - 
—% S 


belongs to N-o. 


590 LINEAR LEAST SQUARES PREDICTION XII 


A function c* satisfying the condition of the theorem must be propor- 
tional to the c* of Theorem 5.2 because the manifold It_,, will then be the 
closed linear manifold generated by the v(f)’s, N, will be that generated 
by N-e and by &(s) increments with arguments < f, so that „N and J 
will be as described in Theorem 5.2 (i), and (i^) now implies the desired 
result. 

The prediction y,_,, of x(t) in terms of the past up to time t— r is 
given by (5.10) because with this definition p,_,, , € 34_. and 2(t)— Yir, r 
is orthogonal to N, because this difference involves only & increments 
with arguments > ¢- T. 

We have seen in (5.12) and (5.13) that y = &(s2) — (sı) corresponds in 
u language to ¥ as defined by (5.12), in the correspondence (5.2). Then, 
since the prediction Wo, + corresponds to the prediction function ¥,„ Y, is 
given by ah 


li eins c*(s) ds 


Py ei we 8 
(u) ma u 


= enu, ues. 


This evaluation can of course also be checked directly from the fact that, 
as so defined, ¥', e Nt and e””""* — YF, is orthogonal to N. 


6. Generalization of §4 and §5 

We now consider the following generalization of the prediction problem 
studied in $4. Let {£n —o<n< oo} be a stationary (wide sense) 
process, and let X be any random variable with E{|X|?} < œ. We wish 
to approximate X by linear combinations of a's for j <n. More pre- 
cisely, we wish to evaluate 


on X) = BX |x, j <n} = B(X | M,), 
the random variable in M, closest to X. In particular, if X = %p4» 
G(X) = Pn,» and the problem of finding p,(X) becomes exactly that 
studied in §4. Ifa is defined by 
æ = Ê{X |x,j=0,+1°° } = EX |Ma} 
then X — 2 is orthogonal to every x; so that 
PAX) = Pale) = Efe | Wt} 


and in the following therefore we shall usually consider x rather than X. 
In 4 language (see §4) the problem can be stated as follows: There is 
given a A function f, measurable with respect to F, the spectral distribution 


—— 


86 GENERALIZATION OF §4 AND §5 591 
1/2 
function of the x, process, with J | f(a)? dF) < co. This function 


—1/2 
corresponds to a in the correspondence (2.2), so that 
1/2 


(6.1) x= | SO dy). 
—1/2 

The projection of f on „M is to be found, that is, the function in the 
closed linear manifold generated by the functions {e?ii j < n} [A weighting 
dF(A)| closest to f. The solution ®,,(f) will be called the nth approximation 
function and ,(X) = y,(%) will be called the nth approximation. 
Evidently + 
(6.2) PAX) = ule) = | DD dy. 

-ip 
In particular, if X = tn» Pn(f) = e27i"4 $, where ®, is the prediction 
function for lag v discussed in §4. The nth mean square error is given by 
(6.3) AX) = E(X Y = EX zl} + Elle — go] 


If the x, process is non-regular, M, = M, for all n. Then in this case 
7x) = x, D(f) =f, and the nth mean square error is independent of n, 
o,4(X) = E{|X — x|). 

In general it is clear that 
On (X) = on (X), 
lim 0,%(X) = E{|X—=|¢} + Elle — Bfe | M) 


(6.4) lim 0, °(X) = E{|X— x} 
Lim. p(X). = Ple | La} 


Lim. p(X) =. 


n> 

In the regular case, if we write the Fourier series for æ in terms of the 
orthonormal sequence of the &;’s of the Wold decomposition theorem, 
Theorem 4.2, 


16.5) x = > yep te, y= Efx} = EXE} v= Bf |M} 
jag 


we can write p(x) in the form 
(6.6) Pr) = > yit r. 
j= 


592 LINEAR LEAST SQUARES PREDICTION xi 


In fact, as so defined, p,(£) eM, and x— p,(¥) is orthogonal to M,,. 
We know from $4 that the 4 function corresponding to &, is ern S ce orate 
0 
where the c;’s are the constants of Theorems 4.2, 4.3, and 4.4. 
It is easily verified that in the regular case we have the following corre- 
spondence between random variables and 4 functions, under (2.2) (using 


the notation of §4): earink 
z 245, 
s 2rijà 
R 2 ce 
x 0, AeS 
ME ays 
E RO AES 
f 
$ per 
= = fi), 245, 
> cen 24 
=0 
SEs at); nom 
> yer 
pie): Of) = A458, 
> oe Qaija 
= fia), 1€S, 
where 12 


14 = Efry = | eO) Se yen? dh, 
~1/2 MY 
The nth mean square error is 


6N oNN) = EX- a+ È ya 


oO n 
The functions > y,e” and > y,e*" are fundamental to this study. 
= oO 


= ic 
We observe that, by the above identification of the Å function correspond- 
ing to x, 


(6.8) È rete = fd) 3 ce 2 
-© 0 
_ SOFO 


So 
0 


§6 GENERALIZATION OF §4 AND §5 593 


for almost all 2, We have written the formula in the last form because of 
its connection with the covariances involved in the prediction problem, 


1/2 
(6.9) pn = E{Xž,} = Efwe,} = | e7?™ f(A) dF(A). 
1 


-1/2 
This formula shows that the p„’s are the Fourier coefficients corresponding 
to a complex-valued function of bounded variation, 
a 
| fa) ak”. 
-1/2 
Thus the p,’s determine fF’, F’ determines the c's, and /F ‘ together with 


the c,’s determines the y,’s by (6.8). 
In the continuous parameter case (using the notation of §5) there isa 


random variable X, and we wish to find 
pdx) = EX |R} = Bix | Ni}, 
x = MX |N}. 


where 


We use the decomposition theorem, Theorem 5.3, Since x eM, we can 
write it on the one hand in the form 


6.) w= f gwd, f |g dou) < », 


-0 -0 


and on the other hand in the form 
(6.5) x= f y*(s) dEl) + 2’, Í pods < o 2 = Bfe |N) 


Then the prediction ,(z) can be written in the form 
t 

(6.6) pa = | FOO +2, 
and the prediction function Y¥(g,'), the x function corresponding to 
y(x), is given by 

t 

ik yr (sje ds 

L(g, u) = = dS 
(8 u) aa # 


= glu) ues. 


594 LINEAR LEAST SQUARES PREDICTION XII 


The mean square error o(X) = E{|X¥—<|?} +f |y*(s)? ds. If py is 
defined by t 


pi = E(X} = Ee} = | eer g(u) dou), 
this covariance function determines gG’ for almost all u (Lebesgue 
measure), G’ determines c, and [comparing (6.1’) and (6.5’)] y* is then 
determined by 


% 


yw = f ei ye) dt 
(6.8) a 


od) =n) = EO. 
(4) 


7. Multidimensional prediction 

In this section we shall deal with N-dimensional random variables. We 
shall therefore adopt the convention that all random variables a, 4y, 2," * * 
are vector random variables unless specifically identified as scalars, and 
constants a, b, c, © © + are (N by N) matrices. It will be convenient to 
think of the random variables as (single-column) matrices. If M isa 
matrix, M will denote its conjugate transpose, |M|? will denote MM, and 
|| M ||? will denote the sum of the elements of MM down the main diagonal, 
that is, the sum of the squares of the moduli of the elements of M. If M 
is a matrix of scalar random variables, E{M} is the matrix of their expecta- 
tions. 

According to our conventions, if x, y are random variables, wy is not 
defined unless N = 1, but g and |æ|? are defined. If E{| |||} < co and 
E{||z||?} < œ, the distance between x and y is defined as EY*{||a — y||"}: 
x is said to be orthogonal to y if Efæğ} = 0. If E{|a|?} = I (identity 
matrix), æ is said to be normed. Thus the concept of an orthonormal 
sequence of random variables is well defined, and in fact a sequence is 
orthonormal if and only if the sequence of scalar components is an 
orthonormal sequence. Observe that, if a, °° *, ®n is an orthonormal 
sequence, 


EŠ ae} = È lajf. 


The concepts of linear manifold and closed linear manifold are now 
defined as in IV §2 (note that the coefficients of combination are N by N 
matrices). There is a slight inconvenience in the orthogonalization of a 
sequence of random variables owing to the fact that cx may be the null 
vector even though c is not the null matrix and x is not the null vector. 


Se e V — <_< a 


§7 MULTIDIMENSIONAL PREDICTION 595 


A sequence W,, Ws, * + of random variables can be orthogonalized in 
the sense that there are linear combinations Y1, Yə, * * * of the w,’s such 
that each w; is a linear combination of y,’s and conversely, and that either 
there are infinitely many y,;’s which form an orthonormal sequence or 
there are finitely many y,’s, Yı * * *, Yn» Such that the y,’s are mutually 
orthogonal, 44, ** *, Ya are normed, and Ef|y,„|?} is a matrix with 
elements all 0 except for some 1’s down the main diagonal. To see this 
simply orthogonalize the scalar components of the w,’s as in IV §2 to get 
an orthonormal sequence of scalar random variables. If there are finitely 
many in this orthonormal sequence, add zeros if necessary to obtain a 
multiple of N in the sequence. This sequence is then grouped in sets of 
N; each set is a y;. 

If æ is a random variable with E{||z||?}< 00 and if yy, Y2 * * * is an 
orthonormal sequence, the series > ajy; with a; = Efxj,} is called the 


J 
Fourier series of æ with respect to the y,’s and the a,’s are called the 
Fourier coefficients. As in IV §3, it is proved that > ay, converges (in 


the sense of the distance used here) that > ||a,||* < 90, and that 
j 
a1) E{{|x|[*} = > [lall 
J 


In the present case this means that > |a,|* converges (that is, the matrix 


4 . 
sum converges element by element) and that the Hermitian symmetric 
matrix |a|?— 5 |a,|2 is non-negative definite. Equality in (7.1) is 

j 


equivalent to 


01) e? = 2 la,|*, 


and this is true if and only if is in the closed linear manifold generated 
by the y,’s. 

The projection x, = B{a | M} of x on the closed linear manifold M is 
as in IV §3 the closest random variable to x in M, Each scalar component 
of æ, is the (scalar) projection of the corresponding scalar component of 
æ on the closed linear manifold generated by the components of the 
random variables in M. The projection x, is characterized by the fact 
that 2, e M and x — x, is orthogonal to M. 

A process with orthogonal increments is defined exactly as in the case 
N=1. It is proved that, if the random variables {y(1), a < t < b} 
constitute such a process, there is a matrix function F such that, EAA 
F(t)— F(s) is Hermitian symmetric, non-negative definite, and 


E{\y()— ys) 7} = FO — FO). 


596 LINEAR LEAST SQUARES PREDICTION XII 


It follows that the diagonal elements of F are monotone non-decreasing ; 
all elements are of bounded variation. In the following, if M is Hermitian 
symmetric and non-negative definite, we shall denote the sum of the 
diagonal elements of M by (M). If ® is an N x N matrix function 
whose elements are measurable with respect to X(F(t)), with 


b 
Z (| oa) aro <0, 


the stochastic integral 


b 
g = | 00 dy) 
ž 
is defined as in IX §2, and satisfies 
b 
Ep?) = | M1 dO, 


where ®,(t) is the integrand corresponding to P; The random variables 
y obtained in this way are exactly those in the closed linear manifold 
generated by the y(t) increments. If g=0 with probability 1, the 
corresponding ® satisfies 


b 
E{|p} = | D0) FOPO) = 0. 


Any two functions ®,, ®, differing by a function ® satisfying this equation 
will have the same stochastic integral. 

A process {£p —© <n< ©} is stationary in the wide sense if 
E{||2,||2} < © and if the covariance (matrix) function 


R(n) = Ens rb my 


does not depend on m. The N component scalar processes are then also 
stationary in the wide sense. There is a spectral representation 
1/2 
a, = | EWI ED = ake) 

—1/2 
where the y(A) process has orthogonal increments and / is the identity. 
The function Fis normalized as usual so that F(— o) = 0, FA +) = F(A); 
it is called the spectral distribution function of the process, and determines 
R(n) by 

1/2 
Roi) = | et" dF). 


~1/2 


§7 MULTIDIMENSIONAL PREDICTION 597 


From now on we shall use the notation of §4 for the corresponding 
concepts here. For example, o,* is the mean square error of prediction 
with lag 1, 

of = E{|t,— Praal 


The matrix o,2 is Hermitian symmetric and non-negative definite. In 
the following we shall suppose that o, is a non-singular matrix and 
derive the Wold decomposition, Theorem 4.2. This assumption on 0; 
excludes the possibility that some linear combination of components of 
x, is completely predictable on the basis of past components of the 2,’s. 
The matrix o,2 is now positive definite, and there is a unique square root 
o, which is Hermitian symmetric and positive definite. Define Ën by 


En = 05 (2, — Pn, 1) 
Then 
Ly — Pnt,1= NEm E{|én l3 =L &,€ My, 


and the &,’s form an orthonormal sequence as in §4. We have the 
orthogonal development 


Ga A 
Ln O + Vp — Unt Y G Ef, Fn}; 
j=0 


and all the orthogonality and inclusion relations of §4 remain true. The 
spectral distribution function F, is absolutely continuous with 


2 


. 
F(a) =| a aces 
Se bo 


and 
F=F, +, 


If © is a A function corresponding to &,, we have, following §4, 
(A) > cent = ezrin 
j=0 


for almost all A and 
1/2 


Í (A) dF, DOA = 0. 


-1/2 


According to the first equation the series is a non-singular matrix for 
almost all 2, and we have 


D) = erin 2 


598 LINEAR LEAST SQUARES PREDICTION xi 


Taking the absolutely continuous part in the second equation, we find that 
1/2 
i (A) FADA dà = 0, 
=1/2 
where F, (4) is Hermitian symmetric and non-negative definite. Since the 
integrand is symmetric and non-negative definite, this means that 
DAF ADA —0 for almost all 4 and, since (A) is almost never 
singular, we finally have that F,/(2) = 0 for almost all å. Then F, is the 
singular component of F just as in §4. The prediction functions and 
predictions are given by the formulas of §4. Since the analytic conditions 
determining the c,’s are not known, we shall not pursue this subject further. 


SUPPLEMENT 


The following is an outline of certain aspects of measure theory in the 
form needed in this book. Lengthy but standard and readily available 
proofs are omitted. Throughout the discussion, Q is an abstract space 
of points w. 


1. Fields of point sets 
DEFINITION A class F of w sets is called a field if it has the following 


properties: 
(i) Qe F; 
(ii) if Ae F, then Q— Ae F; 
(iii) if n is any natural number, and if Ay, +++, Ap E F, then 


G NEF, Q Aef. 

A field is called a Borel field if it has the following additional property: 

(iv) if Ay, Ag, ° + €F, then G Aye F and Q AEF. 

THEOREM 1.1 If Fo is a class of w sets, there is a uniquely determined 
Borel field F of œ sets with the following two properties: 

(i) Fo CF; 

(ii) if F, is a Borel field of w sets, and if Fo CF then F CF. 

The class Z is the smallest Borel field of sets which includes all the sets 
of Fy. There certainly is a Borel field of sets which includes all the sets of 
Fo for example, the Borel field of all œw sets. Define F as the class 
of sets each of which is in every Borel field of œ sets containing the sets of 
F p that is, F is the intersection of all the Borel fields containing the sets 
of F, Then F is a Borel field and has the two properties stated in the 
theorem. The uniqueness is a trivial consequence of these two properties. 

The field F will be called the Borel field generated by Fo, and will be 
denoted by A(F 9). 

THEOREM 1.2 Let Fo be a field of œ sets. Then, if G is a class of w 
sets with the following properties, BF o) C 9. 


() FCF; 
(ii) If A; eG, j = 1, and if either 
A,CA,C** +, ÜA =A 
or 
ADAD QA=A, 
then Ne G. 


599 


600 SUPPLEMENT §1 


The proof will be omitted. The property (ii)+will be described by 
stating that Y includes the limit of any monotone sequence of & sets. 

If Q is n-dimensional space, and if Fy, is the class of open sets, the 
sets of Z(F o) are called n-dimensional Borel sets. The same class of sets 
is obtained if Fp, consists of the closed sets, or the (n-dimensional closed, 
or open) intervals. The same class of sets is also obtained if A, is the 
class of finite unions of right semi-closed intervals, that is, of finite unions 
of sets of the form 


{a < S< Up ee +, ny. 


Here the ås and ups are finite or infinite, except that, if y; = %0, “< 4,” 
is replaced by “< u,” in the above definition, to reject points with infinite 
coordinates, This choice of Fg has the advantage that Fg is then a field. 

In the following we shall discuss œ functions of various types. Any 
% set defined by conditions on functions will be denoted by the defining 
conditions written between the braces. Thus, if v and y are œ functions, 
and if Y is a set of numbers, the œ set 


{a(w) < 3, y(w) € Y} 


is the œ set on which (w) < 3 and y(w) is a number of the set Y. 

If F is a Borel field of œw sets, and if x is an œ function, æ is called 
measurable with respect to F if it is a real-valued function and if, for every 
real number c, {x(w) < c} «F, or if it is a complex-valued function whose 
real and imaginary parts are real measurable functions. It is sufficient 
in the above definition if the constant c ranges through a dense set of real 
numbers, rather than the whole class of real numbers. We note without 
proof the fact that linear combinations of functions measurable with 
respect to F are also measurable with respect to F. If {x,, = l}is a 
sequence of w functions measurable with respect to F, the set of points 
of convergence of the sequence is an F set, and, if the sequence converges 
everywhere, its limit is a function measurable with respect to F. 

If F is the class of Borel measurable sets in n-dimensional space, the 
functions measurable with respect to F are called Borel measurable or 
Baire functions. A -function of n complex variables is called a Borel 
measurable (or Baire) function if it is a-Baire function of the 2n real and 
imaginary parts of the variables. 

If A is a Borel set in n dimensions, the cylinder set it determines in 
n’ > n dimensions (by choosing the first n coordinates to determine points 
of A, and allowing the remaining coordinates to take arbitrary values) is 
a Borel set in’ dimensions. In fact, the class of n-dimensional Borel sets 
for which this assertion is true is obviously a Borel field which includes 
the n-dimensional intervals, and is therefore the class of n-dimensional 

Py. 


§1 FIELDS OF POINT SETS 601 


Borel sets. Corresponding to this fact, if 2 is a Baire function of n 
variables, it determines a function of n’ >n variables (whose function 
values are determined by the first n of the 7’ variables) and this function 
is a Baire function of n’ variables. 

THEOREM 1.3 If Fg is a class of œ sets such that the class of finite 
unions of disjunct F sets is a field, and if Æ is a class of œ functions with 
the following properties, then # includes all w functions measurable with 
the respect to BF). 

(i) # includes every function which takes on the value 1 on an Fy set 
and 0 on its complement. 

(ii) # includes every linear combination of a finite number of its functions. 

(iii) If x, e, n>1, and if A X,(@) = x(w) exists and is finite for 
all w, then x eH, 

In particular, if # is a class of functions of n variables, and if Fà is the 
class of right semi-closed intervals, then # includes every Baire function of 
n variables, if Æ has the above properties. 

In the real [complex] case, the coefficients of combination in (ii) are 
real [complex]. We prove this theorem in the real case. Let be the 
class of those A(F ) sets M with the property that, if 2(w) = 1 on M 
and (w) = 0 otherwise, then eH. Then Z includes the Fy, sets, 
according to (i), finite unions of disjunct F sets, according to (ii), and the 
limits of monotone sequences of . sets, according to (iii). Since B(F 9) 
is the Borel field generated by the field of finite unions of disjunct Fy 
sets, it follows from Theorem 1.2 that includes all A(Fo) sets. 
Now, if x is an arbitrary œ function measurable with respect to B(F,), 
define x,, and x, by 


1 
ao) =1, Å< 


0 otherwise; 


ie win 
J 
co jon an Min 
Then we have just seen that a, «4, and according to property (ii) of the 
theorem x, «eH also. Finally x e3 by property (iii), because x, —> x, 
as was to be proved. 

THEOREM 1.4 JfF is a Borel field of w sets, if &ı,* * +,%, are œ functions 
measurable with respect to F, and if F is a Baire function of n variables, 
then F(a, - + +, X,) is an œ function measurable with respect to F. 

This is a simple application of Theorem 1.3. In fact, if we define # 
as the class of Baire functions of n variables for which the assertion of 
the theorem is true (and Fy as the class of right semi-closed intervals), 


602 SUPPLEMENT §1 


it is trivially verified that # has the properties described in Theorem 1.3, 
and therefore includes all Baire functions of n variables. 
In the following we shall frequently consider w sets of the form 


{[x(@), - + + t(D] € 4}; 


the indicated set is the œ set for which the n-tuple xlo), + * +, Zaw) 
defines a point of the set A. If the x;’s are real, A is to be taken as a 
point set in n dimensions; if the xs are complex, A is to be taken as 
a point set in 2n dimensions, with the obvious conventions, that is, as a 
set of n-tuples each of whose coordinates is a complex number. 

Corowtary IfF is a Borel field of œ sets, if tı, * * *,%, are w Junctions 
measurable with respect to F, and if A is an n-dimensional Borel set 
(2n-dimensional if the x;s are complex valued), then 


{[x(o), + + +, 0,(w)] e A} eF. 


To prove this corollary, define F as the Baire function of n variables 
which is 1 on A and 0 otherwise. Then F(a,,° * *,,) is measurable with 
respect to F, according to Theorem 1.4, so that the œ set 


{Fla(o), +» +, 2,(@)] = 1} = (0), + + + ea(o)] € A} 
is in the class F. 
Let F be a Borel field of w sets and let {x, t eT} be a family of œ 
functions measurable with respect to 7. Then we shall denote by 


B(x, t € T) the Borel field of œ sets generated by the class of those of the 
form {x (w) € A} for te T and A a right semi-closed interval. Obviously 


Ble, tT) CF. 


The Borel field A(a,, t € T) is the smallest Borel field of « sets with respect ~ 
to which the as are all measurable. Then Z(x, t eT) can be substituted 
for F in the preceding corollary, and we find that, if t, ` ° *, fn is a finite 
T set, and if A is an n-dimensional Borel set (or 2n-dimensional if the xps 
are complex-valued), 


{{ar,(@), + + +, %,(@)] € A} «Bx, t eT). 


The class of w sets of the type on the left forms a field which generates 
Bu, t € T), and the same is true if A is restricted to be a finite union of 
right semi-closed intervals. In particular, if T is the set of integers 
1+ + +1, Zep * En) is the class of sets of the form 


{{xy(@), > > +, 2,(@)] € A}, 


where A is an n-dimensional Borel set (or 2n-dimensional if the w,’s are 
complex-valued). 


$1 FIELDS OF POINT SETS 603 


THEOREM 1.5 Let F be a Borel field of w sets and let Xy, * * *, X, be 
« functions measurable with respect oF. Then an % function is measurable 
with respect to B(ay,* * `, En) if and only if it has the form F(%,* * *, tn), 
where F is a Baire function of n variables. 

The “if” half of this theorem is a sharpening of the previous theorem, 
according to which F(a, * * *, %,) is measurable with respect to F if F is 
a Baire function. Since F can be replaced by Aly, `> *, Xn) in the 
statement of Theorem 1.4, however, that theorem asserts that Fly," tq) 
is measurable with respect to B(x, * *.*, 2n). Conversely, let y be an 
w function measurable with respect to Ba, ` * * x,). Suppose that y 
is real-valued, and define 


j+! 


7 | eden: niwa): 


Am =|} << 


There is a Borel set Ajm’ such that 

Ajm = {[%x(@), + * + lO) € Asm}: 
The Aj,’s (for fixed m) are disjunct. Hence, if we define A;,, by 

Ayn = Asm — Aim y, Aims 

the A,,’s for fixed m are also disjunct, and 

Aim = (0), + %n(@)] € Asm}: 
Define Ym by 

inl) = MCI POA 


Then we can write y,, in the form 


Ym = F(t * EAR 
where 


Pie nie EA POE NE 


= 0 otherwise. 


Let R be the set of points in n-dimensional space determined by 
[x,(w), <- +, 2,(@)] as @ varies. Then, since 


lim Fito), + ° %q(@)] = Tim Ym(o) = yo) 


for all w, it follows that 
lim F,(é1,* * "s Ën) 


604 SUPPLEMENT §2 


exists on R. Now the F,,’s are Baire functions. Hence the set of points 
where the sequence of function values has a finite limit is a Borel set 
R'D R. Define the Baire function Fas this limit on R’ and 0 otherwise. 
Then 

FQ, * * 1%) =Y; 


so that y is a Baire function of the 2,’s, as was to be proved. If y is 
complex-valued, the preceding result is applied to its real and imaginary 
parts. 

THEOREM 1.6 Let F be a Borel field of œ sets, and let {x,, t €T} bea 
family of w functions measurable with respecttoF. Let F g= B(x, tE 8), 
where SCT. Suppose that T is non-denumerable. Then, if A €F r, 
there is a denumerable subset S (depending on A) of T, such that À €F s. 
If x is an w function measurable with respect to F p, there is a denumerable 
subset S (depending on x) such that x is measurable with respect to F s. 

Let Z be the class of sets A with the property stated in the theorem. 
Then Y C.F p and we wish to prove that ¥ = Fy. The class ¥ includes 
every w set {a,(@) € A}, where t e T and Ais a Borel set. Moreover, it is 
easy to verify that Y is a Borel field. Then Y = F r, because of the 
minimal property of Fp, as was to be proved. Let x be any real œ 
function measurable with respect to. F p. Then for each rational number 
r we have seen that there is a denumerable subset S, of T such that 


COES r} eF g, 


Let S = y S, Then S is a denumerable subset of T, and æ is measurable 
with respect to Fg. If visa complex-valued function measurable with 
respect to F, there are denumerable subsets S’ and S” of T with the 
property that the real [imaginary] part of x is measurable with respect to 
Fy [F s]. Then, if S = S U 8”, S is denumerable and z is measurable 
with respect to F g. 


2. Set functions 


DEFINITION If F is a field of w sets, a real finite numerically valued 
function q defined on the sets of F is called completely additive if, whenever 
‘Ay, Ao * * are finitely or denumerably infinitely many disjunct F sets, 


with U A; e F then, 
ay A) = Bald} 


If there are only finitely many A,’s, the hypothesis that U A,eF is 
i 


implied by the hypothesis that the AjsareF sets. If there are infinitely 
many A,’s, there is the same implication if the field is a Borel field. 


§2 SET FUNCTIONS 605 


It follows from this definition of complete additivity that, if Mj, Mz, °*~ 
are F sets with either 
M,CM,C:::, UM,;=Me¥ 
or 
M,DM,D: +; QM;=MéeF, 
then á 
lim q{M,} = qM}. 


If A = {C} is an w set determined by conditions “c”, we write q{A} as 
fC} rather than as gC} 

THEOREM 2.1 Let Fo be a field of œ sets. Then, if q and q, are com- 
pletely additive set functions defined on the sets of BF o), and if 

gAy<@{A}, Ae Fo 
it follows that 

ANA) Ae BFo): 
The theorem remains true if “<” is replaced by “=” in the hypothesis and 
conclusion. 

To prove this theorem, we need only remark that the class of B(F o) 
sets on which q, and q are in the stated relationship includes the F y sets, 
and includes limits of monotone sequences of its sets. That is, this class 
satisfies the hypotheses of Theorem 1.2, and therefore is BF o), as was 
to be proved. 

DEFINITION IfF is a field of w sets, a function q defined on the sets of F 
is called a measure if it is completely additive and non-negative, a prob- 
ability measure if it is a measure and if q{Q} = 1. 

If q is a measure, the sets of the field on which it is defined are usually 
called measurable. 

THEOREM 2.2 Let Fy be a field of w sets, and let qo be a measure defined 
on the sets of Fo. Then there is a unique measure q, defined on the sets 


of BF o), with 
i gaoh AeFo 

The uniqueness property is implied by Theorem 2.1. The existence 
proof will be omitted. 

THEOREM 2.3 Let F, F' be Borel fields of œ sets, and let q be a measure 
defined on the sets of a Borel field GD FOF’. Suppose that if N € F' 
there isa A eF with the property that 

gA- AK) V(Q— AA} = 0. 
Then, if x’ is an œ function measurable with respect to F’, there is an w 
function x, measurable with respect to F, such that 


qlo) + x0} = 0. 


606 SUPPLEMENT §2 


Let # be the class of œ functions x’ measurable with respect to F’ for 
which the assertion is true. Then by hypothesis # includes every w 
function which takes on the value 1 on an F” set and 0 on its complement. 
The class # obviously includes every linear combination of a finite number 
of its functions, and the limits of every convergent sequence of its functions. 
Hence # includes all œ functions measurable with respect to F’, by 
Theorem 1.3, as was to be proved. 

Let q be a measure defined on the sets of a Borel field F, and consider 
all œ sets M with the property that there are sets M,, Ma eF, such that 


M,CMCM,, 4{M,— M,}=0. 


The class of sets M form a Borel field F* D F, and F* = F if and only 
if F includes every subset of an F set of measure 0. If M eF above, 
then the relations between M}, M, and M, imply that 


qM} = 9{M,} = 4M3). 


Even if M ¢F, the second and third quantities here are equal. Then, if 
M e F*—.F, we define g{M} by this equation, and with this definition 
q becomes a measure defined on the sets of F*. Note that, if M e F*, 
and if q{M} = 0, then the subsets of M are also F* sets. Each set of 
GF* —F differs from some F set by a subset of an F set of measure 0. 
The operation of extending the domain of the given measure from F to 
F* is called completing the measure, and a measure with the property that 

= F*, that is, that subsets of sets of measure 0 are measurable (and 
have measure 0), is called complete. The operation of completion makes 
a measure a complete measure. We shall always use the notation ¥* 
for the Borel field defined as just described. Note that 7* depends both 
on .F and on the measure q. 

THEOREM 2.4 Let Fy be a field of w sets, and suppose that F, is a 
Borel field of sets with F, BF ). Let q be a measure defined on Fy, 
with the property that, if Ay € Fa, there is a A € B(F o) such that 


HAQ- A) U(Q— ADA} = 0. 
Then: 
(i) If Ae Fy, and if e > 0, there is a A, € Fo such that 
GA(Q— A) U(Q— AJA} < e. 
(ii) If Ae BF o), there is a A’ which is the intersection of denumerably 


many Fo sets, and a A” which is the union of denumerably many Fo sets, 


such that 
ACACA, gf{A"—A}=0. 


§2 SET FUNCTIONS 607 


Gii) If x is an w function measurable with respect to F and if «> 0, 
there is a function £, taking on a finite number of values, each on an Fo 


set, such that 
gleo) — zlo) >< 


and (if æ is integrable) 
fle- zl dq <. 
Q 

To prove (i) [Gi], let @ be the class of sets A e F, [Ae AF) for 
which the assertion is true, Then YD Fp, and it is easily shown that 
g includes limits of monotone sequences of its sets. But then Y 5 BF), 
by Theorem 1.2. In (i) it follows that Y = F,, because of the stated 
relation between F and F; sets. Part (iii) is easily reduced to (i) by 
approximating x by a function taking on only finitely many values, each 
on an F, set to which (i) is applicable. 

Example 2.1 Let Q be the real line, and let Fo be the class (field) of 
finite unions of right semi-closed intervals. Then Z(Fo) is the class of 
linear Borel sets. Let F be a bounded monotone non-decreasing function 
of £, continuous on the right, with lim F(é) =0. If Aisa finite union 
of right semi-closed intervals, Sra 


A =U (as, bj], we Peay Op Bye 
j= 
define 


qA) = > IFE) — Fla) 
j= 
It can be shown that q is a measure on the sets of Fy. Then by Theorem 
2.2. the domain of definition of q can be extended to make g a measure 
of Borel sets, and the measure q can then be further extended by com- 
pletion. The sets of the class BF ,)* obtained by this completion are 
called Lebesgue-Stieltjes measurable, or measurable with respect to F, and 
q is called a Lebesgue-Stieltjes measure. We omit the easy generalization 
to unbounded functions F. The class of Lebesgue-Stieltjes measurable 
sets depends on the choice of F. For example, if F is constant except for 
jumps at finitely or enumerably infinitely many points, every œ set is 
Lebesgue-Stieltjes measurable. In general, however, B(F o)* does not 
contain all œ sets. If 6 is small but positive, the F, set indicated above 


has measure near that of the larger open set Ü (a; b; + ô) and near that 
of the smaller closed set Ù [a; + ô, b;]. Thus A, in Theorem 2.4 (i) can 


be taken as a union of open, or closed, intervals losing of course the 
property that it is an F, set, and A’, A” in Theorem 2.4 (ii) can be taken 


608 SUPPLEMENT §2 


respectively as closed, and open, if convenient. If x is an œ function 
measurable with respect to F, that is, measurable with respect to the field 
&F,)*, its integral over a set A is usually denoted by 


il x(w) dF(w). 
A 


In particular, 
qA) = [dF), A EAF. 
A 


Example 2.2 Let Q be n-dimensional Cartesian space. The following 
is the generalization of the preceding example to n dimensions. Let Fo 
be the field of finite unions of right semi-closed n-dimensional intervals, 
so that Z(F,) is the class of Borel sets in n dimensions. Let F be a 
bounded function of variables, monotone non-decreasing and continuous 
on the right in each variable, with the following further properties: 

(i) ae Fey + é=0 felis ms 

di) ifa,sb,i=1,°° 5% and, if S, is the sum 

EF (by,* * *+ bn) 
with the convention that in each summand k of the b;’s are replaced by 


the corresponding a,’s, and that the sum is taken over all (i) possible 
replacements, then 

So—S,+:°°+C IS, 20. 
If A is a right semi-closed interval, 

A = {a < Ẹbi = l, + 


KA} =So— 8, H: °° + (— "Sm 

and if A is a finite union of such (disjunct) intervals, define g{ A} as the 
corresponding sum. Then it is shown that q is a measure of Fy sets, and 
can therefore be extended to the Borel sets, and then completed. The 
measure obtained in this way is Lebesgue-Stieltjes measure in n dimen- 
sions; the measurable sets and functions are called measurable with 
respect to F. The remarks on the application of Theorem 2.4 made in 
the 1-dimensional case remain valid in dimensions. The integral of a 
measurable and integrable œ function x over a measurable set A is 
denoted by 


define 


f- . IEG nay Èn) d;,. 2N F(é,, UNIS AN 


and in particular 
i=j- -fdinne En + Fade 
A 


§2 SET FUNCTIONS 609 


Conversely, if q is a measure defined on the field of Borel sets of n-dimen- 
sional Cartesian space, and if F is defined by 


FA * Š An =g © < ES ani T Pa +n}, 


F has the properties described in Example 2.1 ifn = 1, and in the present 
example if n> 1. The given measure then agrees with the Lebesgue- 
Stieltjes measure on finite unions of right semi-closed intervals, and 
therefore on all Borel sets, by Theorem 2.1. 

If F, for each j= 1,* + +, has the properties described in Example 
2.1, the function F defined by 


Fey’ sêd =] | AG) 
1 


satisfies the conditions on F in the present example, In this case we find 

that S 

So— S, + +E D'S, = T | Eb) — Fan), 
jel 


and the integral with respect to F is usually written in the form 
f° oe on ++) dF (È): A -dF (Èn), 
A 


and evaluated by iterated integration. This n-dimensional measure is 
sometimes described as the product of the 1-dimensional measures on the 
coordinate axes determined by the F's. 

Example 2.3 Let T be any infinite aggregate, and let Q be the space 
of real-valued functions of f eT. Then a point w is a z function &(-). 
The following is the extension of the preceding examples to infinitely 
many dimensions. Let æ, be the w function with the value &(s) if œ is 
the function &(-), so that x,(m) = &(s), Let h,° * "tn be a finite subset 
of T, and let A be an n-dimensional Borel set. If the z function œ: (C) 
has values E(f), © © Èn) at hy * * *y fw the condition 


CONE E 
defines an @ set. This is the @ set 
[2,(@),* * x, ()) eA. 


Let F, be the class of all œw sets obtained in this way, for arbitrary 7, 
lst s ty A. The class Fa is a field, but not a Borel field, Then 


BF >) = Be, t eT). Suppose that a set function q is defined on the 
sets of Fo, with the following property: for each finite £ set fy * * *s fwr 


qiir o) %,(o)] €A} 


610 SUPPLEMENT §2 


defines a function of the Borel set A which is a measure of n-dimensional 
Borel sets. This is obviously an additive function of Fy sets. One way 
of looking at this definition of q is that, for each finite £ set 4,° * ^, tm 
there is a function F,,...,, of n variables, satisfying the conditions on 
such a function stated in Example 2.2, and 

E - te(ole AY = [> 1° J dey ste Bases sta G + Sos 


The functions {F,, . . „ +} must obey two consistency relations for this 
definition of q to be unique: if ø, * -, «,, is a permutation of the integers 
1,- + +, m, then 


Figs 2 keg E a a h= oper oy st Ea) 
and, if m < n, 
Be AE hem) ee Mm) e E tt Ea) 


Eman Sar 
It is shown that q is a measure, and therefore can be extended to be a 
measure defined on the sets of Z(F,), and then completed. We remark 
that Fp can be replaced throughout this discussion by the slightly smaller 
field (which generates the same Borel field) of sets of the form 


[£ (0), + +5 %,(o)] € A, 


where 4 is a finite union of n-dimensional right semi-closed intervals. 
We forbear extending the further remarks made in the n-dimensional case 
to the present case. 

Leaving these examples, we make a few remarks on integration, and its 
relation to set functions. If is a Borel field of w sets, if q is a measure 
of F sets, and if x is an w function measurable with respect to F, we have 
supposed above that the integrability of x over a measurable set is defined 
in the usual way, and that the standard properties of the integral are 
known. We recall that, if x is a measurable œ function, it is integrable 
if and only if |x| is. The integral of x over A is denoted by 


J x dq or Í æ(w) dq. 


A A 
If x is measurable and integrable over Q, the set function f defined by 
Sd) = | zaq 
A 


is completely additive. Moreover, if g{|x(w)| > 0} > 0, f(A) cannot 
vanish identically. That is, if f(A) = 0, it follows that x(w) = 0 for 
almost all œ. Then two set functions f and g of this type are the same if 


§2 SET FUNCTIONS 611 


and only if the corresponding integrands are equal for almost all w. A 
completely additive function of F sets which vanishes whenever q vanishes 
is called absolutely continuous relative tog. The set function fis obviously 
absolutely continuous relative to q. Conversely (Radon-Nikodym 
theorem), if f is an absolutely continuous function of F sets, it can be 
expressed in the above integral form, with some x which is measurable 
relative to F, and integrable. We have already remarked that æ is then 
uniquely determined by the set function, disregarding values on sets of 
q measure 0. Theorem 2.5 will make this essentially unique determination 
of æ explicit. 

A completely additive function f of F sets is called singular relative to 
q if it does not vanish identically, and if there is an F set M such that 


q(M) = 0 
fidy=0, ACQ—™. 


The set M is called the singular set of f relative to q, or, when f is non- 
negative, the set of increase of f. The set M is not uniquely determined, 
since every set of q measure 0 including a singular set of f relative to q 
is itself a singular set of f relative tog. According to a standard theorem 
of the theory, if f is completely additive, it can be written in the form 
f= fı + fa Where fi is absolutely continuous relative to q, and / is either 
singular relative to q or vanishes identically, This decomposition of f is 
unique, and, if f is non-negative, f, and f, are also non-negative. In the 
non-negative case, the condition for a singular set M of f relative to q 
becomes 
q(M) = 0 
KQ- M) = 9. 


Thus every completely additive f can be written in the form 


fd) = [vdg +f, 
A 


where fy is the singular component, and 2 is determined uniquely, neglect- 
ing values on sets of g measure 0. It will be useful below to have an 
explicit representation of æ in terms of f. The following discussion leads 
to æ as a generalized derivative of f. For each n let My) My 
be finitely or denumerably infinitely many disjunct F sets, with union Q. 
Suppose that each M,("*” is included in some M,™. Define 


(n) 
æ, (w) IM, we M,'”), g(M,™) +0 


zio; üe M;™, q(M,;™) = 0. 


612 SUPPLEMENT §2 


Then a, is a generalized difference quotient of f relative to q. If 
lim x, = æ, exists and is finite for almost all œ (q measure), «,, is called 


n= 

the derivative of f with respect to q relative to the net of the M,™s. The 
derivative will, except in trivial cases, depend on the net. The following 
theorem is stated for future reference. 

THEOREM 2.5 If f is a completely additive function of F sets, and if q 
is a measure of F sets, then there is always a derivative t, with respect to 
q, relative to a given net. The derivative is the density x of the absolutely 
continuous component of f relative to q if and only if the net satisfies the 
following two conditions, in which F. is the Borel field generated by the 
class of M;™”s, j,n = 1. 

(i) x is equal almost everywhere to a function which is measurable relative 
oF... 

(ii) The singular set of f is contained in some F ,, set of measure 0. 

In most applications the net is chosen so that F,, =F, or at least 
that every F set is contained in some F, set of the same measure. In 
this case the net satisfies the above conditions for every f. For example, 
the net has this property if Q is a finite interval, if F is the class of its 
Lebesgue measurable subsets, if q measure is Lebesgue measure, and if 
the M,(""s are intervals (finitely many for each z) such that the maximum 
length of M,™, M,™, + + + goes to O when n—> œ. Theorem 2.5 is 
proved by probability methods in VII §8 under the irrelevant assumption 
that q is a probability measure. 

In Chapter V measures in abstract product spaces are used, and the 
following pages give the necessary background. They can be skipped by 
readers willing to accept the space X in Chapter V as the real line, or at 
least as finite dimensional Cartesian space. 

Let X be any abstract space of points ë, and let T be any aggregate, 
Let Q be the space of functions w of t eT, with values in XY. Then a 
point œ is af function (-). Let x, be the w function with the value &(s) 
if œ is the function &(-), so that 2,(w) = &(s). Throughout the following 
discussion it will be supposed that a Borel field F y of X sets is given. 
We shall denote by ¥ the Borel field of w sets generated by the class of 
w sets of the form {x,(w) € A}, where t €T, A €F x. 

THEOREM 2.6 If A eF, there is a sequence {B,} of F x sets such that, 
if F x’ is the Borel field of F x sets generated by {B,}, and if F’ is the 
Borel field of F sets generated by the class of sets of the form {x{w) € A}, 
where t eT, A €F x', then A eF’. 

Let Y be the class of F sets A with the property stated in the theorem. 
Then Y is obviously a Borel field, SCF, and, if Ae Fx, then 
{x(w)« A}eGY. Hence I =F by definition of F. 


§2 SET FUNCTIONS 613 


Example 2.4 Let T be the class of integers 1,» * *, n. Then a point œw 
is an n-tuple [£(1), ***, &(7)], and we shall denote 7 by Fx xX X Fx 
(n factors). It will be convenient to call the sets of this Borel field 
generalized n-dimensional Borel sets, and to call numerically valued 
functions measurable with respect to this Borel field generalized Baire 
functions of n variables. In particular, the generalized linear Borel sets 
are the Fx sets. If X is the real line, and if F y is the class of linear 
Borel sets, the generalized concepts just defined reduce to the usual ones. 
For any T, we shall denote by Fo the field of w sets of the form 

{z+ + + %,(o)] « A}, 
where n is an arbitrary positive integer, f,* * '» f are arbitrary points of 
T, and A is an arbitrary generalized n-dimensional Borel set, Then 
Fa CF, F is the Borel field generated by Fy, and F = Fo if T is finite. 
We shall discuss measures of F sets in terms of measures of Fo sets. 

Example 2.5 , Let T be the class of integers 1, * * *, m We shall set 
up a measure of F = F, sets of great importance in the theory of prob- 
ability. Let pọ be a measure of generalized linear Borel sets. For each 
j< n, let p; be a function of &, °° * p Ay where &, « X and A is a 
generalized linear Borel set, with the following properties, 

(i) For fixed &:,° * *s Ẹp pAGi* © ‘ $y; °) is a measure of generalized 
linear Borel sets. 

(ii) For fixed A, pj, >'i A) is @ generalized Baire function of 
ee 
Then we define a measure of F sets by the following iterated integral 
(evaluated from right to left) 


J olde) eis ded = f abe» r Ereni dEn) = aA), 


Example 2.6 This example generalizes the preceding one to infinitely 
many dimensions. Let T be the class of positive integers, and suppose 
that Po Py * > * have the properties described in the preceding example, 
except that each measure involved is to be a probability measure. Then 
the definition of g(A) given above now defines q as an additive function 
of Fy sets, completely additive if the Fa sets are restricted to those 
defined, as above, by conditions on £y, * * * #a for fixed n. The fact 
that the measures involved are probability measures implies that, if 
A € Fo, A) is uniquely defined, that is, q(A) is independent of the value 
of n used in the integration. Now, if X is the class of real numbers, and 
if Fx is the class of linear Borel sets, we have already discussed in 
Example 2.3 Kolmogorov's theorem, that every non-negative additive 
function of # sets, completely additive on cach subclass of Fg deter- 
mined by conditions on a fixed finite number of coordinate functions, is 


614 SUPPLEMENT §2 


actually completely additive on Fo, and therefore can be extended to be 
a measure of F sets. This theorem is not necessarily true for general 
X, F x, but it is true when q is defined in terms of functions po, Pr * * * 
as described above. In probability theory, pọ defines an initial probability 
distribution, py, p» * * * define the successive transition probability distri- 
butions, and these functions are frequently the given functions in terms 
of which the ¥ probability measures are to be determined. The italicized 
statement thus asserts that such a determination is possible. To prove 
the statement it is sufficient to prove that, if 


ADAD aA, = 0; A, «Fo, 
n 
then 
lim 4(4,) = 0. 
nx 


In fact then, if My, My, «+ * are disjunct A sets with union M eFy, 
and if A, is defined by r 
A, =M— g M;, 
we have 
ADA DAS: QA =0, An EF o 


q(A,) =M) — 3 4(A), 


so that the limit equation q(A,,) 0 implies the desired complete 
additivity. We can suppose that A, has the form 


A, = {It@), + +) 2a (0)] € An}, 
where A, is a generalized a,,-dimensional Borel set. Then 
gba) = f pls) «+ | wb’ +s Ears Mad) 
= f ©,(E)p(as), 
x 


where ®, has the obvious definition. Here 
%20,>°°°, 
so that the sequence {®,,} is convergent to some limit ®, and we have 


lim g(A,) = [ ®)p(@8)- 
no x. 


Now let ô be the limit on the left. We must prove that ô = 0. If > 0, 
it follows that there is an %1, with ®(7,) > 0. Now let Q be the class 


§2 SET FUNCTIONS 615 


of sequences w): (&, Ea °° +), & € X, and define a function of Q” 


sets by 
WA) = f poni dE) «+ | pom fxs Eni dE) 


w 
where 

AM = {fe P00), + 2, P0)] € AM}, 
2,0) is the obvious coordinate function, and A is a generalized (n — 1)- 
dimensional Borel set. Let A, be the œ™® set determined by the 


conditions 
(Mp č» En * *)€ An 
Then 


N Gy. A, as nA, = 0, qA) > Pin) > 0. 
n 


We are thus in a position to repeat the argument already given. In this 
way we find successively points 7, 7a, ' * * such that for every pair m, n 
there are values Of Ëm} Em+2 ` * * Such that 


(ny * s Ams Emaar ® DE Ane 
Now consider the point wọ: (Mı 12° * °). For each n, 
(is © s Naa Eana Eaa © °) € An = AO) s a O) € An) 
for properly chosen Ea”, Fa,y2")" > That is, 
(my © yay) € An: 
Then wọ € A, and therefore  € Q A,. This contradicts the hypothesis 


that the A,’s have a null intersection. Hence ô = 0, as was to be proved. 

Before applying Example 2.6 we remark that, although 2 was defined 
as an infinite product space all of whose factor spaces were the same, 
the factor spaces could have been taken to be different without altering 
the argument. 

The first application is to independent measures. Suppose that 
Pils «+ «573 A) = pA) does not depend on ëy, © + *, & We have then 
set up the usual product measure in infinite dimensional space. The 
hypotheses have imposed no restriction on the p; measures. 

As a second application, let X be the real line, and let F y be the class 
of linear Borel sets. Then, if q is any additive function of Fo sets, 
which, for each n, is a probability measure on the class of Fg sets of the 
form 
(leo), «+ +. e(o)] € A}, 
where A is an n-dimensional Borel set, q is a measure and can 
therefore be extended to F sets. This is actually a special case of the 


616 SUPPLEMENT §2 


result we have just proved, because, according to I §9, and using prob- 
ability language, there is a conditional distribution of x, relative to 


2, * * *; a because the range of values of a, is a Borel set, namely the 
line. That is, in the language we are using here, the Fo set function q is 
determined by functions po, pı, ` ` ` as described above. Of course this 


proof, as well as Kolmogorov’s original one, is a licable more generally 
if X is a Borel set in n-dimensional space and if F y is the class of Borel 
subsets of X. The obvious generalization to more general topological 
spaces will be omitted. 

Finally we remark that we have assumed in this discussion that T is 
denumerable. This is not so strong a restriction as might appear at first. 
In fact, suppose that q is an additive function of Fy sets and that we are 
trying to prove that q is a measure by proving that if 


A seo CW NACE ys nT feat 


then 

lim q(A,) = 0. 

no 
Since each A, is defined by conditions on finitely many «,’s, at most 
denumerably many parameter values are involved in all the A,,’s and only 
these need be considered in the discussion. Thus the general case can 
be reduced to the case of denumerable T. Whether this reduction is 
useful or not depends on the hypotheses under examination. 

Example 2.7 We now consider Example 2.5 in more detail, choosing 
n= 2 to avoid unessential complexities. We write p(é; A) rather than 
pilé; A). Suppose that a measure p of F x sets is given. Then for cach 
e X we can write the measure p(é,; *) as the sum of its absolutely 
continuous and singular components relative to p, obtaining 


ple; A) = | NEn Edd) + AEs A). 
A 


It is desirable for many purposes to have y measurable in the pair 51, bo, 
that is, measurable with respect to the Borel field Fy x Fy =F. If 
y has this property, AÇ; A) will be a & function measurable with respect 
toF x. According to our derivation, the é, function y(&,, *) is measurable 
with respect to F x and is uniquely determined only up to values on sets 
of p measure 0, The problem is to choose this function for each &, to 
obtain £, & measurability. This can be done as follows, under the 
additional hypothesis that there is a sequence {B;} of F x sets such that, 
if 2 is the Borel field generated by the class of B,'s, then y(&,, *) is for each 
&, a &, function equal almost everywhere (p measure) to a function measur- 
able with respect to Y, and for each-&, some G set is a singular set for the 
set function A(E,; °). This hypothesis is satisfied in all cases of interest. 


§3 MEASURE-PRESERVING TRANSFORMATIONS 617 


For example, if X is the real line, if F y is the class of linear Borel sets, 
and if p is any measure of Borel sets, completed or not, the B,’s of this 
condition can be taken as the open intervals with rational endpoints. 


Define 
Bos peer: Ba 


. . n : . 
as the intersections of the form Q A; where A, is either B; or X— B;. 


The B,(’s are disjunct and have union X, and the Borel field generated 
by the class of BRs j,n = 1, is G. Define 
_ pé; BS”) 
Yn(E1s £) = “PBN EEB p(B; ™) #0, 
=0 é € B;™, (Bi) = 0. 


Then y,, is &,, Ë measurable, since it is a generalized Baire function of $ 
on each &,, & measurable set 
{fe X, &e B;™}. 

We have seen above that, for each &,, 

lim Ynn 2) = Yoo'Frs $2) 

no 
exists and is finite for almost all & (p measure). Here y.n(&;, *) is the 
derivative of the measure p(&,; *) with respect to measure relative to the 
net of B;™’s. Moreover, by hypothesis, y(é,; `) is equal almost every- 
where to a function measurable with respect to Y, and some singular set 
of A(é,;*) isa Y set. Hence, by Theorem 2.5, Y,,(£1; *) is one version of 
the density of the absolutely continuous component of p(é,; *) relative to 
p measure. Moreover, Ya 18 &,, & measurable, as the limit of a sequence 
of &,, & measurable functions. The é function y..(615 +) may be undefined 
on an F x set of p measure 0. We define the function as 0 on this set, 
and the &,, & function obtained in this way is the desired function y. 


3. Measure-preserving transformations 

DEFINITION Let Q [Q] be an abstract space of points w [a]. Let 
q [4] be a measure defined on the sets of a Borel field F [F] of w [A] sets. 
Let T be a single-valued transformation defined on the points of Q, taking 
them into points of Õ. If À eF, let TA be the w set {Tw e A}. Then 
T is called a single-valued measure-preserving point transformation if the 
following two assertions are true: 

O if AeF, then TA eF, and 

HÄ} = Ti); 
(ii) if A eF, there is a RK eŽ such that A = TOA. 


618 SUPPLEMENT §3 


According to this definition, T20 = Q, so that 
HQ} = AY. 


The same reasoning shows that, if R is the range of T, the @ set of images 
of œ points under T, and, if RCA&A, it follows that 

HQ} = A} = HO}. 
Thus Ñ has outer measure ĝ{®} in the sense that every measurable æ set 
containing R has measure FQ}. 

Each point & corresponds to the œ set of points going into @ under T. 
These « sets will be called elementary sets below. Since every A eF 
is the inverse image of some A, A = TA, it follows that every F set 
is the union of elementary œ sets, and that, if x is a measurable w function, 
x(w) is constant on each elementary œ set. Then every @ function # 
determines a unique œ function x, denoted by T, with 


4#(@) = x(T ő). 


Here T-1a is to be interpreted as any one of the inverse images of ©. If 
& is & measurable, x = T-1% is w measurable, because, if A is a Borel 
set, and if A is defined by 
A = {%(@) € A}, 
then 
{ax(w) € A} = {&(To) € A} = {To e A} = TA, 


so that the w set on the left is measurable. Conversely, if œ is a measurable 
w function, there is a measurable @ function #, denoted by Tx such that 
æ= Tč. In other words, the image function of x, which is defined by 
the above relation between x and # only on R, can be defined on QGQ—R 
to yield a measurable @ function. The extension on O— R is not 
unique, but two measurable image functions of v will agree on R and 
therefore will agree on an @ set of measure GQ}, since they agree on a 
measurable set containing R. We derive an image of x as follows. (We 
can assume that x is real.) Let A, = {æ(w) < r} for rational r, and let 
M, be an @ measurable set with 


A, = TM, 
Such a set M, exists by hypothesis. Finally define 
A,=0 M, 


Then A, is & measurable, and 
RiGee Wir es 
A, = TOA, 


§3 MEASURE-PRESERVING TRANSFORMATIONS 619 


We have 
a(w) = G.L.B. r, 
and we define 
#6) =GLB.r, «UA, 
acA, a 


=0, DEA UA. 
7 


The transformation T operating on measurable œ functions is not 
necessarily single-valued, although it is single-valued if œ functions equal 
almost everywhere are considered identical. On the other hand, the 
inverse transformation T- (as applied to functions) is single-valued. In 
particular, applying the function transformation to functions taking on 
only the values 0 and 1, the transformation T induces a transformation 
operating on measurable œ sets, and taking them into measurable @ sets. 
This set transformation is single-valued if two @ sets differing only by a 
set of measure 0 are considered identical. The inverse set transformation 
is single-valued. 

The following properties of the transformation T, considered as a trans- 
formation on functions, are now easily checked. 

If a4, © + +, @, are measurable œ functions, if x; = T% and if ® is a 
Baire function of n variables, then 


O(a, + + +, En) = TIDE, + + + n). 
If 2, vp, * + + are measurable w functions, and if x; = Tč, then 


lim x,(@) 


exists for almost all œ if and only if 
lim #,(@) 
n= 
exists for almost all &, and if these limits do exist they are transforms of 
each other under T, neglecting values on sets of measure 0. If a is a 
measurable o function, and if 2 = T-4, then « is integrable if and only 
if # is, and, if these functions are integrable, 
fe dq = j & dq. 
fe} à 
The last point may deserve a few remarks. If #(@) takes on only the 
values 0 and 1, the same is true of x(w), and the equality of integrals 


becomes the fact that T makes correspond measurable sets of the same 
measure. Then, since the class of functions æ for which the statement is 


620 SUPPLEMENT §3 


true includes linear combinations of functions in the class, it follows that 
the statement is true for functions # which take on only finitely many 
values. The general case can obviously be reduced to the case of a non: 
negative #. Suppose then that # is non-negative, and define 


ee i 
m=% <10 H > janb 
=0 O) >n. 


Then lim #,(6) = #(@), and, if z, = T-%,, and v = T—2, it follows that 


n= 


lim 2,,(@) = x(w) 


n> 
for all œ. Moreover, 

ŽD) < D) Sn 
so that 

slo) Sro) Sse 


Since x, takes on only finitely many values, 


| z, dq = | č, då. 


a à 


Then when n —> œ we obtain the desired inequality between the integrals 
of x and & (in which infinite integrals are allowed). 

Finally we observe that the essential starting point of the discussion is 
the existence of a single-valued transformation T™, operating on F sets, 
and taking them into F sets in such a way that complements, unions, and 
intersections of ¥ sets go into the corresponding complements, unions, 
and intersections of the image F sets. In fact, the original description of 
T as a single-valued point transformation served only to define LE Gs 
such a set transformation, and the whole discussion could have been 
carried through on the basis of the postulation of the desired properties 
of T=. We have not done this because the greater generality achieved 
in this way is not required for the applications to be made in I (see the 
following examples and I §6; a more general approach is developed in 
X §1). 

Example 3.1 Let Q be an abstract space of points œ. Let q be a 
measure defined on the sets of a Borel field of œ sets, and let x be a real 
measurable œw function. Let F = B(x) be the Borel field of œ sets of 
the form {a() e A}, where A is a linear Borel set. Let © be the real 
line, let F be the class of linear Borel sets, and let q be the measure of 


Borel sets defined by E a 
HA} = g{x() € A}. 


§3 MEASURE-PRESERVING TRANSFORMATIONS 621 


For each w define 6 = T(w) = x(w). Then T is a transformation taking 
a point œ into a real number ø. Under this transformation, goes into 
a linear set R. Note that T is single-valued, but that its inverse is 
multiple-valued, in general. If A is a linear Borel set, we shall denote by 
TOA the set of all œ with To e A. We remark that, if A is a Borel set, 


and if A = T-1A, then 


aA} = [dF@), FE) = geo) < 4}, 
A 


where the integral is a Lebesgue-Stieltjes integral. In fact, we have seen 
in Example 2.2 that every measure of Borel sets can be expressed in this 
form, with suitable F, where F can be chosen to make the evaluation 
correct for A a right semi-closed interval. In the present case the evalua- 
tion is correct for such a A by definition of F. If we now restrict 
“measurable o set” to mean “F set,” the preceding results mean that T 
is a single-valued measure-preserving point transformation. Note that, 


if © is a Baire function of a single variable, then 
P(x) = T*0(), 


so that, by our general results, 


C3 


J (x) dq = J (6) dF(). 
Q 


-0 


In particular, we obtain the result 
IEZ = i 6 dF(@), 
Q -0 


a familiar result, which also follows trivially from the fact that the sums 
usually used in approximating the integral on the left are the Riemann- 
Stieltjes sums used in approximating the integral on the right. 

Example 3.2 This example is the generalization of the previous one to 
an arbitrary dimensionality. Let Q be an abstract space of points w. 
Let q be a measure defined on the sets of a Borel field of sets, and let 
{xn te T} be a family of real measurable functions. Let F = Ba, tT). 
Let O: {a} be the class of real functions &(:) of teT. Let #, be the & 
function with value &(s) if @ is the function &(-), so that #(@) = &(s). 
Let fhs" «+, ta bea finite subset of T, and let 4 be an n-dimensional right 
semi-closed interval, or a finite union of such intervals. Let F be the 


field of @ sets of the form 
{[2,(@,* > + (Ol € 4} 


622 SUPPLEMENT §3 


Let ¥ = Ae, t e T) be the Borel field of & sets generated by F,. For 
each o define @ = T(w) as the function of ¢ obtained from x(w) when t 
varies. Then T is a single-valued point transformation taking © into 
RCO. The inverse transformation is not single-valued, in general, and 
we define TA, for A an @ set, as the œ set on which Tw e A. Then, if 
Ä e Ž, it follows that T4A e7. In fact, this is true by our definitions 
if A e Žo, and, since the F sets for which this assertion is true form a 
Borel field, the assertion is true for all F sets. If AcF, and if 


A = TAA, define 
HA} = gf A}. 


We remark that if T is infinite this 7 measure is the measure determined 
from the measure of F, sets by extension to F, as discussed in Example 
2.3, because the measures agree on F, sets, and F is the Borel field 
generated by #. The transformation T is a single-valued measure- 
preserving point transformation, if o measurability of a set A is now taken 
to mean that A e7. According to our general results on this type of 
transformation, 
qllo) < c} = HE) < c} 


and more generally, if f, * * * tn is a finite T set and if @ is a Baire function 
of n variables, then 


| Oen i 2.) dg = f OE: + +) a. 
a à 


That is to say, if either integral is defined, both are, and the two are equal. 
Thus for many purposes the family {x, t eT} of œ functions can be 
replaced by the family {čą t¢7} of @ functions. The čs are the 
coordinate variables in a space of dimensionality the cardinal number of 
the aggregate T. 


APPENDIX 


CHAPTER I 


§1-§5 

The basic paper on probability as measure theory is Kolmogorov’s [5, 1933]. 

The hypothesis that P measure is complete is used only when separability and 
measurability of a stochastic process (see II, §2) are of interest. Thus this hypothesis 
will never actually be needed when only finite or enumerably infinite collections of 
random variables (discrete parameter stochastic processes) are under consideration, 
and will not be needed for a considerable part of the study of non-denumerably infinite 
families of random variables (continuous parameter stochastic processes). 


§6 
Representation theory was stressed by Doob [4, 1938] as a device to reduce various 
probability theorems to standard measure theorems. 


§7, §8 
The measure theoretic definitions and basic properties of conditional probabilities 
and expectations were given by Kolmogorov [5, 1933]. 


§9 


Theorem 9.4 was not originally in the text. Mrs. Shuh-teh Chen Moy pointed out 
to the author that it was contained in his proof of Theorem 9.5, and that it was worth 
separating out. 

To avoid overburdening the text, Theorem 9.5 was not stated with maximum gener- 
ality. In the first place, the only property of the range R actually used was its Lebesgue- 
Stieltjes measurability for every Lebesgue-Stieltjes measure. Since analytic sets have 
this property, the theorem remains true if R is only supposed analytic, rather than 
Borel. In the second place, the theorem is obviously true if its hypotheses are satisfied 
when Q is replaced in the hypotheses by a subset Q, of probability 1, We shall call 
the theorem, with R analytic and Q replaced by Qy, the generalized version of Theorem 
9.5. (It can be shown that, with the replacement of Q by Q, the theorem is no more 
general when R is only supposed analytic than when R is supposed Borel.) The 
hypotheses of the generalized version of Theorem 9.5 are satisfied (for every choice of 
Yis * * *sYn) if Q is a complete metric space, and if the given probability measure is a 
measure (completed or not) of Borel sets. The hypotheses are also satisfied (for every 
choice of Y1, * * *, Yn) if Q and the given probability measure have the property that, 
if æ is any random variable, and if A is a linear set such that {a(w) «€ A} is a measurable 


w set, then 
P{a(w) « A} = G.L.B. P{x(w) € B}, 
BOA 


where B is open. Gnedenko and Kolmogorov [1, 1949] impose the latter condition 
as part of their definition of a probability measure. 

Note that, if the class F, of measurable w sets is the Borel field generated by a 
denumerable subclass, there is a random variable 2 such that the class of sets 


623 


624 APPENDIX 


{a(w) « A}, for A a linear Borel set, is F,. Then, if the existence of a conditional 
probability distribution of æ is assured by the generalized version of Theorem 9.5, there 
is a conditional distribution of F, sets relative to any Borel field 7 of measurable 
sets, 

Finally, consider the following example. The space {2 is the interval [0, 1]. Let 
F be the class of Borel subsets of 2. Let A be a subset of Q of outer Lebesgue measure 
1 and inner Lebesgue measure 0, fixed throughout the following, Let F, be the Borel 
field generated by A and the F sets, that is, F, is the class of sets of the form 
AB, U A'B,, where A’ is the complement of A, and B,, B, are F sets, A probability 
measure of F, sets is defined by 


P{AB, U A’B,} = b[m(B,) + m(B,)], 


where m(B) is the Lebesgue measure of B. It is easily verified that this definition is 
unique, and actually defines a probability measure of F, sets. This probability 
measure reduces to Lebesgue measure on the Borel sets, and has value 4 for the set A. 
It is easy to verify that there is no conditional probability distribution of F, sets 
relative to F. Let be a function defined on this space, with the property that the 
class of sets {x(w) « B} for B Borel is F,. It is trivial to define such a function, and 
such a function cannot have a conditional probability distribution relative to *. This 
example contradicts a theorem of Doob [4, 1938, Theorem 3.1] according to which a 
conditional probability measure of F, sets relative to a field F always exists if F, is 
the Borel field generated by a denumerable subclass of its sets, The incorrectness of 
this theorem and a related theorem [ibid., Theorem 1.1] were pointed out by Dieudonné 
and by Andersen and Jessen. A counterexample somewhat more special than that 
just given, and as such not quite contradicting the existence of a conditional probability 
distribution as defined in this book, is in Halmos [3, 1950, §48]. 


§10 
The results in this section are due to Kolmogorov [5, 1933]. 


§il 
The inequalities (11.8), (11.8'), (11.8), (11.9), (11.5), (11.10) are new as stated, but 
are implicit in the literature, at least in special cases. They were suggested to the 
writer on seeing (11.8) (with a = 1/x and A the interval [0, a]), and a variation of 
(11.9), in Wintner [3, p. 18]. 


CHAPTER IL 


§1 

For some purposes it is convenient to let the parameter of a stochastic process be 
a set in a certain additive family of sets. See Bochner [2, 1942] for a general discussion 
from this point of view. For example, the Brownian motion process, §9 Example 1, 
can be generalized by supposing that to each finite union / of k-dimensional intervals 
there corresponds a Gaussian random variable æ, with expectation 0 and variance 
the k-dimensional volume of J, and that x, - +, æ, have an n variate Gaussian 
distribution with E{,,7,,} the volume of the intersection /,/,, When k = 1, if I; is 
the interval (0, r], and if y: = æ, the stochastic process {y,,0 < f < œ} is a Brownian 


APPENDIX 625 


motion process as defined in §9 Example 1. We observe that even when < is identified 
with a member of a family of sets, as in this example, the existence of such a process 
is guaranteed by the validity of the Kolmogorov consistency conditions discussed in 
I §5, because the parameter was abstract-valued in that discussion, 


§2 

For further discussion of the separability and measurability of stochastic processes, 
and allied topics, see Doob [3, 1937; 5, 1940; 9, 1947], Ambrose [1, 1940], Doob 
and Ambrose [1, 1940). The point of view in §2, and used throughout this book, is 
somewhat more general than that in the referenced papers in that the choice of the 
basic w space as function Space or a slight modification thereof, so that the random 
variables of the stochastic process under discussion become coordinate variables, is 
now rejected except as an elegant Special case. The phraseology of the basic theorems * 
on separability and measurability therefore differs somewhat in the above references 
from that in §2. The relation between them is discussed in Doob and Ambrose 
[I, 1940]. For a point of view similar to that in this book see Slutsky (2, 1937). 

If P measure had not been Supposed complete, in the definition of separability on 
p- 51 the two w sets on the last line would have been Supposed to differ by a measurable 
subset of A, 


For an early treatment of $ 


Jatt, o) at 

a 
not as an integral of the sample function æ(, w) but as a random variable defined as a 
mean limit of the Riemann sums, see Slutsky [1, 1928}. 


§3 
The relation between strict and wide sense ideas is well known to probabilists in 
various special cases. It has been defined more carefully than usual, and followed 
through more systematically than usual in the rest of this book, to help in the under- 
standing and organization of a large body of results, 


§6 
Markov processes were called stochastically definite processes in Kolmogorov 
13, 1931]. The Markov property is sometimes carelessly defined to be the property 
that, if s < 7, then the conditional probability P(x) < 4 | .x,} does not depend on x, 
for r <s, This definition is of course incorrect, because (whether the process has 
the Markov property or not) the indicated conditional probability is a random variable 
which is defined with no reference whatever to any x, for r < s, 


§7 
The name martingale is due to Ville (1, 1939]. The martingale property is called the 
property E in Doob [5, 1940}. 
§9 


Processes with independent increments were called differential processes in Doob 
[3, 1937), homogeneous processes in Cramér (1, 1937] (who only considered the case of 
Stationary increments), integrals with random elements in Lévy [2, 1934; 5, 1937], and 
additive processes in Lévy [7, 1948]. The systematic study of these processes was 
initiated by de Finetti [1, 1929]. 


— 


ear 


626 APPENDIX 


CHAPTER III 


§1 
See Kolmogorov [5, 1933], Jessen [1, 1934] for various versions of the zero-one law. 
Numerous special cases had been observed before the general theorem was discovered. 


§2 

The strong half of Theorem 2.1, the inequality (2.1’), is due to Kolmogorov [1, 1928], 
who supposed that the y;s were mutually independent. However, the fact that his 
proof uses only the hypotheses stated in Theorem 2.1, involving conditional expecta- 
tions, has been noted by several authors. 

Theorem 2.3 is due to Khintchine and Kolmogorov [1, 1924]. Theorems 2.4 and 
2.5 are due to Kolmogorov [I, 1928, and the preceding reference], The work of these 
authors was developed further by Lévy [1, 1931], Jessen [I, 1934], Jessen and Wintner 
[I, 1935]. See also Marcinkiewicz [1, 1937; 2, 1938], Marcinkiewicz and Zygmund 
[I, 1937], van Kampen and Wintner [1, 1937], van Kampen [1, 1940], Wintner’s books 
[2, 1938; 3, 1947], Lévy’s book [4, 1937], Kunisawa [I, 1949]. Instead of dealing 
directly with an infinite series of mutually independent random variables, one can treat 
the infinite convolution of distribution functions of the random variables, This is the 
approach used by van Kampen and Wintner. Lévy’s approach stresses methods 
involving explicitly the decreasing concentration of the successive partial sums of a 
series of mutually independent random variables. Kawata [l, 1941] simplifies this 
approach somewhat by averaging Lévy’s function of concentration of a distribution, 
and Kunisawa in the above reference gave a complete treatment based on such an 
average. Kawata and Udagawa [I, 1949] proved the possibility in Theorem 2.7 of 
criteria involving the characteristic function on a set of positive Lebesgue measure, 
obtaining slightly weaker results than those proved in that theorem. 

The proof of the corollary to Theorem 2.7 used the inequality 


Mevis] i|, 
j 


valid for any complex g,’s of modulus < 1. It is sufficient to prove this inequality 
for finitely many g;’s. It is trivially true for only one g;. To finish the proof we need 
only assume that, if the inequality is true for g1, * * -, Sn, it is true for gr, * * * Snir 


This follows from 
n 
= (TTe- 1) gus + Sn. — i 
1 
n 
Tel 
1 


n+l 
<žla— 1l. 


< 


Enma | 


§3 


Necessary and sufficient conditions for the weak law of large numbers were found 
by Kolmogorov [1, 1928], and in somewhat more general cases by Feller [4, 1937]. 


APPENDIX 627 


See also Marcinkiewicz [3, 1938], Doeblin [2, 1939], Gnedenko [1, 1939; 3, 1944], 
Kunisawa [1, 1949]. 

Theorem 3.4 is due to Kolmogorov [2, 1930]. 

See Loéve [2, 1945] for these theorems for dependent random variables. 


§4 

The expression (4.6) for the characteristic function of an infinitely divisible law was 
discovered by Lévy [2, 1934]. A derivation in the particular case when the distribution 
has a second moment had already been given by Kolmogorov [4, 1932]. See also 
Cramér [1, 1937 Chapter VIII] for Kolmogoroy’s result. The first analytic derivations 
were given by Khintchine [3, 1937] and Feller [3, 1937]. The derivation given here is 
somewhat more straightforward than these because of the availability of the character- 
istic function inequalities of I §11. 

For general discussions of limit laws of sums of mutually independent random 
variables see Doeblin [2, 1939], Gnedenko [1, 1939; 3, 1944], Khintchine (4, 1938], 
Gnedenko and Kolmogorov [1, 1949]. 

Necessary and sufficient conditions for the central limit theorem (essentially equiva- 
lent to Theorem 4.2) were obtained by Lévy [3, 1935] and in analytic form by Feller 
[I, 1935]. See also Doeblin [2, 1939], Gnedenko [1, 1939], Marcinkiewicz [3, 1938]. 
Theorem 4.3 is due to Lindeberg [1, 1922]. Theorem 4.4 is a classical version of the 
central limit theorem due to Liapounoy. See Loéve [2, 1945] for the extensions of 
these theorems to dependent random variables, and the references to the literature on 
these extensions. 


§5 
Theorem 5,1 is due to Kolmogorov [5, 1933]. The proof given is Kolmogoroy’s, 
Theorem 5.2 is due to Doob [2, 1936). 


CHAPTER IV 


Since there are many good treatments of orthogonal functions, from various points 
of view, the treatment in IV is condensed. It cannot properly be omitted, For 
example, the Riesz-Fischer theorem that a sequence of functions converging in the 
mean according to the Cauchy criterion has a limit in the mean, that is, that Ly is a 
complete space, is no less a probability theorem than the central limit theorem, and 
belongs in this book as much as the latter. This is not meant as an attempt to 
appropriate the theory of orthogonal functions for the theory of probability as dis- 
tinguished from measure theory or Hilbert space theory. However, the relation of 
orthogonality between two functions is precisely the kind of relationship exploited by 
Probability theory, and this is no less true because the theory of orthogonal functions 
was not developed in connection with what was considered the theory of probability 
at the time of the development. As a concession to tradition and the readability of 
texts on orthogonal functions the proof of the Riesz-Fischer theorem was omitted 
in 1 §4, 

Asa general reference to IV see books by Kaczmarz and Steinhaus [1, 1935] and by 
Stone [2, 1932]. In particular for §6 see Zygmund [1, 1935]. 


628 APPENDIX 


CHAPTER V 


§1-§4 

For general treatments of Markov chains, see books by Hostinsky [1, 1931], Fréchet 
[2, 1938] (in both of which detailed references to the literature are given), Romanovski 
[1, 1949], Feller (6, 1950]. Treatments of Markov chains with no restriction on the 
number of states have been given by Kolmogorov [6, 1936}, Yosida and Kakutani 
[1, 1939], Doob [7, 1942], Feller [6, 1950]. The fundamental result in the case of 
finitely many states, that of §2 Case (b), goes back to Markov [I, 1906] and has been 
rediscovered frequently. The fundamental work in the case of infinitely many states 
is due to Kolmogorov [6, 1936). 

§5 

The treatment in §5 is essentially that of Doeblin [1, 1937], but generalized to an 
abstract state space, and amplified by an analysis of the possible classes of D triples 
p,e, v. Treatments of Markov processes with (possibly) continuous state spaces have 
been given in various degrees of generality, with many different methods, by Kolmo- 
goroy [3, 1931], Fréchet [1, 1934], Krylov and Bogolioubov [2, 3, 1937], Doob [4, 1938; 
10, 1948], Doeblin [1, 1937; 5, 1940], Yosida and Kakutani [2, 1941], Beboutoft 
[1, 1942], Yaglom [I, 1947], Yosida [2, 1948]. The most far reaching of these is 
Doeblin’s 1940 paper. The earlier and less definitive papers have been ignored in 
this listing. 

§6 

See also Doob (4, 1938; 10, 1948], Yosida [1, 1940], Kakutani [1, 1940] for the law 

of large numbers for Markov processes from a different point of view. 


§7 
The central limit theorem in the Markoy chain case is due to Markoy [2, 1924]. 
Theorem 7.5, under the additional assumption that f is bounded, is due to Doeblin 
{1, 1937). 
For a discussion of the central limit theorem for a sequence whose elements are 
independent if sufficiently far apart, as in our application of Theorem 7.5’, see Hoeffding 
and Robbins [1, 1948]. 


CHAPTER VI 


§1 
For Markov chains with a continuous parameter see Kolmogorov [3, 1931], Krylov 
and Bogoliouboy [1, 1936], Doeblin [4, 1940), Doob [7, 1942; 8, 1945]. See also 
Hille [1, 1948] for an analytic approach from the point of view of semi-groups. 
Theorem 1,1 is due to Doeblin [4, 1940]. Theorems 1.2, 1.3, and 1.4 are contained in 
much more general results of Doeblin [3, 1939]. (See also §2 below.) 


§2 
Pospisil [1, 1935-36] and Feller [2, 1936] treated the Markov processes considered in 
§2, obtaining existence and uniqueness theorems in the case of a bounded function q 
in (2.2). The treatment was purely analytic, considering the problem that of solving 
the Chapman-Kolmogorov integral equation. Doeblin (3, 1939] considered the 
problem probabilistically, treating the problem as that of the analysis of Markov 


APPENDIX 629 


processes under assumptions strong enough to assure that almost all sample functions 
are step functions, Feller (5, 1940] treated the problem as an analytic one, like 
Pospisil, weakening the hypotheses of Pospisil and Doeblin, and his own early ones, 
These authors all assumed, implicitly or explicitly, that the processes under considera- 
tion have the property that almost all the sample functions are step functions. In the 
special case of Markov chains, Doob [8, 1945] showed the possibility of dropping this 
assumption. The treatment in §2 follows that of the latter reference, appropriately 
generalized, and thus includes the results of Pospisil, Doeblin, and Feller, aside from 
the fact that those authors did not assume the stationarity of the stochastic transition 
functions. Lévy [8, 1951] gave a detailed analysis of the various possible types of 
sample functions in the chain case, including those whose discontinuities are not at a 
well-ordered set on the f-axis. 
§3 

The processes discussed in this section were first studied systematically by Kolmogorov 
(3, 1931], who established the fact that the transition probabilities satisfy the partial 
differential equations (3.4) and (3.4’). Feller [2, 1936] proved the appropriate existence 
and uniqueness theorems for solutions of these equations, Ito [3, 1946; 4, 1951) 
showed that these processes could be obtained constructively by solving stochastic 
differential equations. This work of Ito is given in §3, with modifications made 
possible by other work in this book, Fortet [1, 1943] used Feller's results to analyze 
the continuity and related properties of the sample functions of these processes. 
Kolmogorov, Feller, and Ito also discussed in the indicated references more general 
processes, combinations of those discussed in §2 and §3. 

For a treatment of the solutions of the differential equations (3.4) and (3.4%) as 
limiting transition probabilities of sums of dependent random variables, in effect a 
study of one type of generalization of the central limit theorem, see, for example, 
Bernstein [2, 1938], Khintchine [1, 1933). 


CHAPTER VII 


şi 

Martingales have been studied by many authors, referred to below. See particularly 
Lévy [5, 1937], Ville [1, 1939], Doob [5, 1940}. Semi-martingales are here introduced 
for the first time. 

We recall that the random variables of a family {æy t « T} are said to be uniformly 
integrable if 

lim f æ] dP =0 
Ne (\x(@)|>N) 

uniformly in £. A necessary and sufficient condition for uniform integrability is that 
E{|æ,|} be bounded in f, and that, if P{A} = ð, 


lim f |x| dP = 0. 
50” 


It is sufficient for uniform integrability if E{|x,|*} is bounded in r for some æ > 1. 
If {2,7 > 1} is a sequence of non-negative random variables converging with 
probability 1 to a, with expectations converging to the finite limit c, then (Fatou) 
c > Efx}. There is equality if and only if the «q's are uniformly integrable. 


630 APPENDIX 


§2 
Theorems 2.1 and 2.2 are new, Theorem 2.3 is a sharpening of a theorem of Halmos 
[I, 1939]. 
§3 
Theorem 3.1, as applied to martingales, is due to Doob [5, 1940]. 
The theorem that any process which is a martingale in both parameter orders has 
the property that, for every pair of parameter values s, f, 


P{x,(@) = x(@)} = 1 


is new. The proof given in the text is due to J, R. Kinney and J. L. Snell. 

Theorem 3.2 for martingales [in which case (3.4”) can be obtained from (3.4) by 
applying (3.4’) to the process {— æ; 1 <j < n}] was first used by Lévy and Ville. 

Theorem 3.3 for martingales is due to Doob [13, 1951]. The fact that the theorem 
is also true for semi-martingales is due to J. L. Snell, and the proof in the text is his. 

Theorem 3.4 is new. 

§4 

The martingale convergence theorems of this section are taken from Doob [5, 1940), 
with slightly strengthened subsidiary results. Various special cases had been dis- 
covered by other authors, as detailed below. The semi-martingale theorems are new 
(but see below the discussion of work of Andersen and Jessen, who obtained somewhat 
weaker theorems in a different formulation. 

Theorem 4.1 (V) is due to Lévy [5, 1937, Theorem 68], under a regularity condition 
ON tn41— Vy Slightly different from (4.2). 

Theorem 4.1, Corollary 2, is due to Lévy [5, 1937, Corollary 68]. 

Theorem 4.3, Corollary 1, the second limit equation of (4.13), is due essentially to 
Lévy [4, 1935; 5, 1937, Theorem 41], who proved the theorem in a slightly different 
form, in which z is a random variable which is the characteristic function of a point 
set, so that the conditional expectations become conditional probabilities. 

Jessen [1, 1934] proved what is essentially Theorem 4.3, Corollary 1, in the case in 
which the y,’s are mutually independent with a common distribution, each distributed 
uniformly in the interval [0, 1] (see §7 for a statement of his result). 

The following remarks are made to clarify the relation between the theorems of 
Andersen and Jessen [1, 1946; 3, 1948] and the martingale convergence theorems of §4, 
Let {æn Fy, n = 1} be a martingale, and let pẹ be the completely additive function of 
F, sets defined by 


© Pad) = | endP, Ae Fy. 
A 


Then, if m <n, 
Pm(A) = pA), Ae Fn 


Let 7, be the Borel field of w sets generated by U Fp. If there is a random variable 


n 
a, Such that {an Fn, 1 < n < co} is a martingale, as will be true for example if the 
æ&,’s are uniformly integrable, then (i) defines a completely additive absolutely continuous 
function p, Of Fo sets when n = œ, In this case Pn is simply the function Pao with 
domain of definition contracted to ¥,, and 2, is the density of p, relative to the given 
probability measure (also with domain contracted to Fn). Suppose, however, that the 
hypothesis of the existence of xo with the stated properties is replaced by the hypothesis 
that the 2,’s are non-negative. Then, by Theorem 4.1, lim 2, exists and is finite with 


no 


APPENDIX 631 


probability 1, and we define x as this limit. Note that {En 1 < n < ©} is now not 
necessarily a martingale, but that, using Fatou’s lemma, 


lim fandP > f aodP, Ne Fm, 
n=>o A Xx 
so that {— a, 1 < n < ©} is a semi-martingale. If Ae UL Fn, (A) is independent 
of n for large n, and we define a 
Poo(A) = lim @p(A). 
n>a 


The set function Pa, defined on the field U Fp, is not necessarily completely additive, 
n 


as is seen by the following example, and complete additivity is a slightly weaker 
condition than the condition of uniform integrability of the x, sequence, used above. 
In §8 an example is given of a martingale {£n n > 1} with Q the interval [0, 1], having 
the following properties: 
20, Efra} =1, lim a,()=0, &44. 
nm 

The basic probability measure is Lebesgue measure. If we delete the point 4 from Q, 
lim a, = 0 everywhere on Q. Define Z, as the class of finite unions of the intervals 


( j al: Fa ee 


IHP qH 


except that 0 is included in the interval with j = 0 and 4 is excluded from the interval 
with j = 2”— 1. Then, if /, is the interval with the latter value of /, 


a(§)=2"4, E elm 
Š 
I A o es QEN: 
On the other hand, 
1 = | æn dP = palin) = Pral) = ` © © = Poll), 


In 


and it follows that pa is not completely additive, or we would have 
o 
g(h) = È, Poll = n) = 0. 
T 


Conversely, let Q be an abstract space, let 7, C F, c+ + + bea monotone sequence 
co 

of Borel fields of w sets, and let Fa be the Borel field of w sets generated by U Fn. 
1 


Let P be a probability measure of F sets, let p be a completely additive absolutely 
continuous (relative to P) function of F sets, and let pn [Pp] be p [P] with its domain 
of definition contracted to F,. Let a, be the density of pn relative to P,. Andersen 
and Jessen [1, 1946] proved that lim a, = to with probability 1. Since, under the 
present hypotheses, arg 


PaA) = [ay dP = Gao(A) = f 2o dP, Ae Fm 
A A 


we have 
Ea = Efo | Fx} 


632 APPENDIX 


with probability 1. Thus the Andersen-Jessen result becomes a special case of Theorem 


4.3, according to which 
lim E{ea | Fn} = Elta | Foo} 
no 


with probability 1. More generally, Andersen and Jessen proved the existence of the 
limit £% under the assumption that each pẹ is absolutely continuous, but that Pa is 
completely additive, without necessarily being absolutely continuous. The limit is 
then the density of the absolutely continuous component of Pæ. In this case, 


Gn(A) = | %m4P = PA) = fmdP mam Ac Fm 
so that the x, process is a martingale. Moreover, if K is the variation of p, and Ky 
that of ,, we have 

KiS Ka S'e LK, Elen} = Kn, 


so that the existence of the limit a. is a consequence of Theorem 4.1. In a later paper 
[3, 1948], Andersen and Jessen assumed only that po was completely additive, dropping 
the assumption of absolute continuity of the 7,’s. They defined x, as the density of 
the absolutely continuous component of Pn, n < %, and proved that in this case also 
lim a, exists and is finite with probability 1. To put this in the frame of martingale 
nn 

theory, suppose that p is non-negative. (If this hypothesis is not true, œ can be 
expressed as the difference between two such p's.) Suppose that m < n, and that 
Ae F,. Then, if the singular component of Pm vanishes on A, 


PA) = | tn dP = py) = f n dP. 
A A 


Since the singular set of Pm has probability 0, its introduction does not change the above 
integrals, and we have, with no restriction on the relation of A to the singular set of Ym 


f amdP > f andP, NeFq, m<n, 
A A 
that is, 

Em = EG | Fp}, msn 


with probability 1, In other words, the sequence {— tn, Fn, n > l}isa semi-martingale 
whose random variables are non-positive. We now conclude from Theorem 4.1s that 


lim a, exists and is finite with probability 1. We omit the discussion of the Andersen- 
nn 


Jessen identification of x, with the density of the absolutely continuous component of 
Po, but see §8, where such an identification is made. The discussion of derivatives in 
§8 is a special case of the work of Andersen and Jessen since a pa is given in each case. 
The specialization to derivatives with respect to nets in §8 was made in view of the 
applications to be made elsewhere in the book, and to avoid unnecessary abstraction 
in examples of martingale methodology. The situation treated by Andersen and 
Jessen is less general than that treated in §4 in that Andersen and Jessen always assume 
the existence of a completely additive p, whereas we have seen above that martingale 
theory treats cases in which the martingale is not derived from such a set function. 
On the other hand, Andersen and Jessen make a more complete identification of a 
with the density of the absolutely continuous component of Pæ, obtaining thereby an 
explicit representation of the singular set of pa. 


APPENDIX 633 


We omit the case, also treated by Andersen and Jessen, in which the sequence of 
fields {F,} is monotone non-increasing. In this case Andersen and Jessen obtain 
what are essentially the corresponding convergence theorems of §4 in a different 
language. 

§5 

The derivation of the zero-one law from martingale theory is due to Lévy [5, 1937]. 

The suggested application of martingale theory to prove many of the standard 
convergence theorems about infinite series of mutually independent random variables 


appears to be new. 
Theorem 5.1 (and other results in this same area, proved by different methods) can 


be found in Marcinkiewicz and Zygmund [1, 1937], Marcinkiewicz [1, 1937]. The 
methods used in the text are taken from Doob [6, 1940]. 


§6 
The application of martingale theory to derive the strong law of large numbers for 
mutually independent random variables with a common distribution function is taken 
from Doob [11, 1949]. 


§7 
The theorems in this section are due to Jessen [1, 1934]. 


§8 
The theorems of this section are easily derived from general theorems on derivatives 
of set functions due to de Possel [1, 1935]. They are implicit in Andersen and Jessen 
[I, 1946]. See the discussion of the Andersen-Jessen papers in the notes to §4. 


§9 

The application in §9 is given merely to clarify the significance of the likelihood 
ratio. See also Doob [13, 1951]. A deeper examination of the theory of statistical 
estimation from the point of view of martingales was given in Doob [11, 1949]. 

The consistency of the method of maximum likelihood, that is, the asymptotic 
correctness of the maximum likelihood estimates of the true parameter of a distribution 
in terms of a finite sample, was proved by Wald [2, 1949]. An earlier proof by Doob 
[l, 1934] was marred by an overenthusiastic use of the strong law of large numbers, 
but the slip is easily rectified, as noted by Doob in the Wald reference, to give the 
same results as Wald’s method. 

§10 


The fundamental theorem of sequential analysis is due to Wald [I, 1944]. See 
Blackwell and Girshick [1, 1946] for another approach to this theorem in terms of 
martingale theory. 

$11 

The theorems in this section are new, except as indicated below. Some are, however, 
trivial generalizations of discrete parameter theorems, 

In Doob [5, 1940] it was proved that the sample functions of a separable martingale 
with parameter set an interval almost all have left-hand limits at all points. In Doob 
[13, 1951] this result was strengthened to give Theorem 11.5 in the martingale case. 


634 APPENDIX 


The treatment of the strong law of large numbers for processes with stationary 
independent increments is taken from Doob [6, 1940]. More precise results, involving 
upper limiting functions for a, in (11.3) for t > œ, were obtained by Gnedenko 
[2, 1943]. 

Theorem 11.9 was stated with an indication of a (different) proof by Lévy [7, 1948, 
p: 78]. Lévy's statement is somewhat more general, but is easily reduced to Theorem 
11.9. 

For the central limit theorem for martingales see Lévy [3, 1935; 5, 1937]. 


§12 
For further applications of martingale theory to the continuity of sample functions 
of Markov processes see Doob [7, 1942; 13, 1951]. 


CHAPTER VIII 


§1 
The study of processes with independent increments was initiated by de Finetti 
[I, 1929]. See Lévy’s books [5, 1937; 7, 1948] for detailed treatments of these 
processes. 
§2 
The first rigorous study of the Brownian motion process was made by Wiener 
[l, 1923]. However, Bachelier [1, 1900] had already discovered many of the properties 
of this process. See Lévy’s books [5, 1937; 7, 1948] for deep studies of the Brownian 
motion process and further references. Theorem 2.1 is due to Bachelier [1, 1900]. 
Theorem 2.2 is due to Wiener [1, 1923]. Theorem 2.3 is due to Lévy [6, 1940]. 


§3 
For Einstein's work on the Brownian motion process see Einstein [1, 1906], and see 
Barnes and Silverman [1, 1934] for a discussion of the significance of this process in 
physical measurements. 


§4, 35 
The application of the Poisson process in §5 is new, 


§6, §7 

The centering of the general process of independent increments is due to Lévy 
12, 1934], and §6 presumably carries out his ideas in somewhat more detail than he 
has given. 

Theorem 7.1 is due to Lévy [2, 1934; 5, 1937], who proves also that (ii) (b) is implied 
by (ii) (a) even if there are fixed points of discontinuity. 

Theorem 7.2 is due to Lévy [2, 1934]. See Lévy [5, 1937] for a detailed discussion 
of the significance of the Lévy formula (7.2) in terms of the sample function properties. 
Ito [1, 1942] has expressed the general process with independent increments as a kind 
of generalized integral of Poisson processes, thus exhibiting this relationship in an 
elegant way. 


APPENDIX 635 


CHAPTER IX 


§1, §2 
Stochastic integrals of the type discussed in §2 were first discussed by Wiener, having 
been introduced by him, in a somewhat indirect form, in Wiener [1, 1923]. Such 
integrals are now a commonplace in Hilbert space discussions, in a somewhat different 
appearing form. For example, if, for each real number f, Ê(r) is a projection operator 
acting on the L, space of measurable functions whose squares are integrable, projecting 
the space on the closed linear manifold M, if æ is an element of La, and if M, c WM, 
when s < t, for example if the family of projections is a canonical resolution of the 
identity (see Stone [2, 1932]), then the family of elements of L, {E()a, — œ < t < œ} 
is a stochastic process (assuming that the basic measure is a probability measure) with 

orthogonal increments. Integrals of the form 


OLO 


are standard tools in the Hilbert space operational calculus. (See the notes to X §3, 
and Stone [2, 1932]. 
For a very general approach to stochastic integrals see Bochner [2, 1942]. 


§3 
This section is an adaptation of part of a paper by Khintchine [5, 1938], chosen for 
its general significance. See also Blanc-Lapierre [1, 1945] for more work in this same 
direction, 
§5 
The stochastic integral in §5 is a generalization of one defined by Ito [2, 1944], who 
treated the case in which the y(t) process is the Brownian motion process. The use 
of martingale theory makes it possible to construct a closed system of these stochastic 
integrals, so that the integral with a variable upper limit defines a process of the same 
type as the process providing the original differential element. 


CHAPTER X 


şi 

The discussion of measure-preserving transformations and the corresponding wide 
sense discussion are more general than usual in order to make it obvious that the theory 
of these transformations is exactly the same as that of stationary stochastic processes, 
although the two may not appear to be more than formally similar when stated at 
different levels of generality. For background on the subject see Hopf [1, 1937] and 
Halmos [2, 1949]. For general treatments (in probability language) of processes 
stationary in the wide sense see Cramér [2, 1940], Doob [12, 1949], Karhunen [1, 1946; 
2, 1947], Lévy [7, 1948], Loève [l, 1945; 4, 1946], Maruyama [1, 1949], Slutsky 
[3, 1938], Wold [1, 1938]. In the following historical remarks no distinction is usually 
made between discrete and continuous parameter processes, or between real and 
complex cases where a proof in one case can be paraphrased into a proof for the other. 

Theorem 1.1 is due to Doob [4, 1938]. 


636 APPENDIX 


§2 
The ergodic theorem (Theorem 2.1) is due to G. D. Birkhoff [1, 1931]. The proof 
given here is taken, with insignificant modifications, from F. Riesz [1, 1945]. 


§3, §4 

If {a,,— © <n < ©} is a process which is stationary in the wide sense, it was 
shown in §1 that we can write x, = U"z,, where U is unitary, and that conversely such 
a formula always defines a process stationary in the wide sense. (Here we always take 
a unitary transformation as one defined on a specific space of integrable squared func- 
tions, rather than one defined on an abstract Hilbert space.) Now von Neumann 
[I, 1929] and Wintner [I, 1929] proved that, if U is a unitary transformation, with 
domain W, there is for each in the interval [- 3, 3] a closed linear manifold 
MAc Mt such that: 

(a) M 4) is the manifold containing only the random variables vanishing almost 
everywhere, and M) = M; 

(b) MA) C Myo if å < p; 

(©) MA) = A x Mu), —4<å2<}, 


and that, if fA is the projection of M on MA), then, for any æ, y e M, 
1/2 

© E{U'%)g} = | erin dEle). 
—1/2 


The above properties of the projection operators imply that the process 
{£()u, — } < A<}) is a process with orthogonal increments, and that E(x} 
is real and monotone non-decreasing in å. Thus, if = y, (i) becomes another version 
of the expression (3.2) for R(n). The equation (i) can be written in other ways, depend- 
ing on the type of integral used, for example as 


1/2 

(ii) Ux = | erm afi(aye 
1/2 

and as n 
1/2 

(iii) Ure f| erin aPl). 
~1/2 


In the form (ii) the yon Neumann-Wintner representation becomes the spectral repre- 
sentation of a discrete parameter stationary process (wide sense), Theorem 4.1. We 
stress again that the probability situation is neither more nor less general than the 
Hilbert space situation (aside from the accident that in the probability discussion the 
measure Space has measure 1). The two are the same, but in different language and 
with different emphasis, The reader is referred to Stone [2, 1932] for further details 
of the Hilbert space arguments. The proof of Theorem 4.1 given in the text follows 
a general method of Crameér [4, 1951]. 

‘The basic properties of wide sense stationary processes were given, in the continuous 
parameter case, by Khintchine [2, 1934], at a time when it was not yet clear that the 
theory was that of unitary translation groups in a different context. His work was 
translated into the discrete parameter case by Wold [1, 1938]. Khintchine proved 
Theorem 3.1 in the continuous parameter case, and then used the continuous parameter 
version of Theorem 3.2 to express the covariance function as a Fourier-Stieltjes trans- 
form. (See XI §3, §4.) The spectral representation theorem, Theorem 4.1, was 


APPENDIX 637 


published first by Cramér [2, 1942], but was apparently independently discovered at 
about the same time by Loéve. See Lévy [7, 1948, pp. 123, 298] for a discussion of 
this matter, The Russian school had, however, already discovered the identity of the 
probability and Hilbert space problems. For example, Obukhoff [1, 1941] uses the 
spectral representation of Theorem 4.1 explicitly, referring as justification to earlier 
Hilbert space work of Kolmogorov (to be taken up in XI), who had mentioned the 
probability interpretation. 

Many of the theorems on wide sense stationary processes, such as the representation 
of a covariance function as a Fourier-Stieltjes transform, are closely related to corre- 
sponding theorems on the harmonic analysis of individual functions, for which see 
Wiener [2, 1930]. 

Theorem 3,2 is due to Herglotz [1, 1911). 


§6 
Theorem 6.1, the law of large numbers for stationary processes (wide sense), also 
called the L, ergodic theorem, is due to von Neumann [3, 1932] in the language of 
Hilbert space transformations, to Khintchine [2, 1934] in the language of probability 
(both in the continuous parameter case). To follow through the full parallelism 
between strict and wide sense theorems, it would have been necessary to prove that, 
if U is isometric, 


exists for all x, and is the projection of x on the manifold of functions invariant under 
U. This theorem was omitted because its proof would have taken the discussion too 
far afield. 

Theorem 6.2 is due to Loéve [1, 1945] (continuous parameter case), and was later 
proved independently by Blanc-Lapierre and Brard [1, 1946]. 


§7 
Theorem 7.1 is new, but see also related work of Grenander [1, 1951] and Grenander 
and Rosenblatt [1, 1952]. Note that, according to Maruyama [l, 1949], a real 
stationary Gaussian process with zero means is metrically transitive if and only if its 
spectral distribution function is continuous. 


§8, $9, §10 


The material in these sections was obtained more or less independently by many 
authors. See the general references already given. 


CHAPTER XI 
(See also the notes to the corresponding sections of X.) 


§1-§4 
The continuous parameter theorem corresponding to that of yon Neumann and 
Wintner on the form of the iterates of a unitary transformation is the following. Let 
{U,, — œ < t < œ} bea family of unitary transformations with domain M, for which 
U = UU, — © <s, t< ©, Then (if an added continuity condition described 


638 APPENDIX 


below is imposed) there is for each real number 4 a closed linear manifold MA) cM 


such that 
(a) NMA) is the manifold containing only the random variables which vanish 


almost everywhere, and y MA) is dense in M; 
(b) MA) c Mp) if 4 < u; 
© MA) = t a <A<m; 
and that, if £(A) is the projection of IN on MA), then, for any a, y eM, 


ca 


© E(Ua)g} = | et EEA) 
or, in alternative versions, Sä 
(ii) Ua = f er dBase 
(iii) U= | er ah). 
-0 


The first version, with y = x, yields the representation (3.2) of the covariance function 

as a Fourier-Stieltjes transform, the second yields the spectral representation (4.1) of 

a continuous parameter (wide sense) stationary process. The above result on unitary 

groups is due to Stone [1, 1930], under the hypothesis that, for all a, y, E{((Ux)9} 

defines a continuous function of z. Von Neumann [2, 1932] proved that the measur- 

ability of this ¢ function (for all æ, y) implies its continuity if M is separable. 
Theorem 3.2 is due to Bochner [1, 1932]. 


§8 
The fact that, if {a(), — © < t < œ} is a Brownian motion process, w(t) is a 
(fictitious) stationary process with constant spectral density was first stressed by Wiener 
[see 3, 1930, as well as earlier papers). 
§9 
The French school describes linear operations as discussed in §9 as filters. 


şi 


The results of this section, aside from the part on stochastic integrals, are due to 
Kolmogorov [9, 1940; 10, 1940]. See also von Neumann and Schönberg [1, 1941]. 


CHAPTER XII 


§1-§5 
If F is monotone non-decreasing in the interval [— $, 4], Szegö [1, 1920] proved that 
1/2 gia 
n=l log F'(A) da 
lim Min i [em*i4 — oe dF(A) = ae g i 


ND doy? * s bn- —1/2 


APPENDIX 639 


with the obvious convention when the integral on the right is — 2. Szegö treated 
this as an ordinary problem of polynomial approximation, proving the stated result 
for all p > 0, under the hypothesis that F was absolutely continuous. Wold [1, 1938] 
proved the fundamental decomposition result, Theorem 4.2. Kolmogorov [8, 1939; 
11, 1941; 12, 1941] put the Wold decomposition theorem in its analytic setting, 
obtaining Theorems 4.1 and 4.3. The Kolmogorov results generalize the Szegd 
theorem, for the case p = 2, to arbitrary monotone F. The limit in Szegé's theorem 
is the mean square prediction error with lag 1. (See the general discussion of prediction 
theory in §1.) Wiener [3, 1942] obtained Kolmogoroy’s results for absolutely con- 
tinuous F, independently, and solved the corresponding continuous parameter pre- 
diction problem, stressing the practical problem of finding the prediction explicitly in 
a useful form for electrical engineering. Krein [1, 1944; 2, 1945; 3, 1945] treated the 
discrete and continuous parameter problems in a somewhat more general form, with 
arbitrary F. Hanner [l, 1949] treated the continuous parameter problem in a more 
probabilistic manner than his predecessors, without the use of the spectral representa- 
tion theorem, and was the first to obtain the continuous parameter analogue of the 
Wold decomposition theorem. Karhunen [3, 1950] obtained these results using the 
spectral representation theorem. Ahiezer [1, 1947] treated the Szegö problem with 
arbitrary Fand p > 1, proving that the indicated limit is 0 if and only if the logarithmic 
integral is — 00, proving also the corresponding result in the continuous parameter 
case. Loéve ([3, 1946] and a section by Loéve in Lévy [7, 1948}) obtained a version 
of the Wold decomposition theorem for non-stationary processes. 


SUPPLEMENT 


The reader is referred to Halmos (3, 1950] for general measure theoretic background 
material, and the proofs omitted in the Supplement. 


§2 

(Examples 2,3, 2.6.) The fact that finite dimensional measures of Borel sets can be 
extended to infinite dimensional measures as described in Example 2.3 is due to Daniell 
[1, 1918-1919; 2, 1919-1920] and Kolmogorov [5, 1933]. A proof that the theorem 
is applicable even if the factor spaces are abstract appeared in Doob [4, 1938]. The 
latter result is, however, incorrect in general. See the counterexample in Andersen 
and Jessen (2, 1948] or Halmos [3, 1950, §49]. The first proof that this result is correct 
at least in the case of independent factor measures was given by von Neumann [4, 1935] 
(see also Andersen and Jessen [2, 1948]). The fact that the result is correct whenever 
there are conditional probability distributions, the essential hypothesis of the text, is 
due to Ionescu Tulcea [1, 1949]. 


æ ’ mire 4 ron: 


rape e 
Ay 


apin altars 


BIBLIOGRAPHY 


N. I. AHIEZER 
[L] Lectures on the theory of approximation. Moscow-Leningrad, 1947 (Russian). 


WARREN AMBROSE 
[1] On measurable stochastic processes. Trans. Am. Math. Soc. 47, 66-79 (1940). 


ERIK SPARRE ANDERSEN, BØRGE JESSEN 
[1] Some limit theorems on integrals in an abstract set. Danske Vid. Selsk. Mat.-Fys. 


Medd. 22, no. 14, 29 pp. (1946). 
[2] On the introduction of measures in infinite product sets. Danske Vid. Selsk. 


Mat.-Fys. Medd. 25, no. 4, 8 pp. (1948). 
[3] Some limit theorems on set-functions. Danske Vid. Selsk. Mat.-Fys. Medd. 25, 


no. 5, 8 pp. (1948). 


L. BACHELIER 
[1] Théorie de la speculation. Ann. ‘Sci. École Norm. Sup. (3), 21-86 (1900). 


R. B. BARNES, S. SILVERMAN 


[1] Brownian motion as a natural limit to all measuring processes. Revs. Modern 
Phys. 6, 162-192 (1934). 


M. BEBOUTOFF 
[1] Markoff chains with a compact state space. Rec. Math, (Mat. Sbornik) N.S., 
10 (52), 213-238 (1942). 


SERGE BERNSTEIN 
[1] Sur l'extension du théorème limite du calcul des probabilités aux sommes de 


quantités dépendantes. Math. Ann. 97, 1-59 (1927). 
[2] Equations differentielles stochastiques. Actualités Sci. Ind. 738, 5-31 (1938). 


Georce D. BIRKHOFF 
[1] Proof of the ergodic theorem. Proc. Natl. Acad. Sci. U.S.A. 17, 656-660 (1931). 


D. BLACKWELL, M. A, GIRSHICK 
[1] On functions of sequences of independent chance vectors with applications to 
the problem of the “random walk” in k dimensions. Ann. Math, Statistics 17, 
310-317 (1946). 


A. BLANC-LAPIERRE 
[1] Sur certaines fonctions aléatoires stationnaires. Applications a l'étude des 
fluctuations due a la structure de l'électricité. Thesis, Université de Paris, 1945, 


80 pp. 
641 


642 BIBLIOGRAPHY 


A. BLANC-LAPIERRE, R. BRARD 


[1] Les fonctions aléatoires stationnaires et la loi des grands nombres. Bull. Soc. 
Math. France 74, 102-115 (1946). 


S. BocHNER 


[1] Fouriersche Integrale. Leipzig 1932. 
[2] Stochastic processes. Ann. Math. 48, 1014-1061 (1942). 


K. L. Counc, W. H. J, Fucus 


[1] On the distribution of values of sums of random variables. Mem. Am. Math. 
Soc. no. 6, 12 pp. (1951). 


H. CRAMÉR 


[1] Random variables and probability distributions. Cambridge Tracts in Math. 
no. 36 (1937). 

[2] On the theory of stationary random processes. Ann. Math. 41, 215-230 (1940). 

[3] On harmonic analysis in certain functional spaces. Ark. Mat. Astr. Fys. 28B, 
no. 12, 17 pp. (1942). 

[4] A contribution to the theory of stochastic processes. Proc. Sec. Berkeley Symp. 
Math. Statistics and Prob. Berkeley 1951, 329-339. 


P. J. DANIELL 
[1] Integrals in an infinite number of dimensions. Ann. Math. (2) 20, 281-288 
(1918-1919). 
[2] Functions of limited variation in an infinite number of dimensions. Ann. Math. 
(2) 21, 30-38 (1919-1920). 


W. DOEBLIN 


[1] Sur les propriétés asymptotiques de mouvement régis par certains types de 
chaines simples. Bull. Math. Soc. Roum. Sci. 39, no. 1, 57-115; no. 2, 3-61 (1937). 

[2] Sur les sommes d'un grand nombre de variables aléatoires indépendantes. Bull. 
Sci. Math, 63, 23-64 (1939), 

[3] Sur certains mouvements aléatoires discontinus. Skand. Aktuarietidskr. 22, 
211-222 (1939). 

[4] Sur l'équation matricielle A{+*) = 4( 4) et ses applications aux probabilités 
en chaine. Bull. Sci. Math. (2), 62, 21-32 (1938); 64, 35-37 (1940), 

[5] Éléments d'une théorie générale des chaines simple constantes de Markoff. 
Ann. Sci. École Norm. Sup. (3) 57, 61-111 (1940). 


J. L. Doos 


[I] Probability and statistics. Trans. Am. Math. Soc. 36, 759-775 (1934). 

[2] Note on probability, Ann. Math. 37, 363-367 (1936). 

[3] Stochastic processes depending on a continuous parameter. Trans. Am. Math. 
Soc. 42, 107-140 (1937). 

[4] Stochastic processes with an integral-valued parameter. Trans. Am. Math. Soc. 
44, 87-150 (1938). 

[5] Regularity properties of certain families of chance variables. Trans. Am. Math. 
Soc. 47, 455-486 (1940), 


BIBLIOGRAPHY 643 


[o] The law of large numbers for continuous stochastic processes. Duke Math. J. 
6, 290-306 (1940). 

[7] Topics in the theory of Markoff chains, Trans. Am. Math. Soc. 52, 37-64 (1942). 

[8] Markoff chains—denumerable case. Trans. Am. Math. Soc. 58, 455-473 (1945). 

[9] Probability in function space. Bull. Am. Math. Soc. 53, 15-30 (1947). 

[10] Asymptotic properties of Markoff transition probabilities. Trans. Am. Math. 
Soc. 63, 393-421 (1948). 

[11] Application of the theory of martingales. Le Calcul des Probabilités et ses 
Applications, Colloques Internationaux du Centre National de la Recherche 
Scientifique, Paris 1949, 23-27. 

[12] Time series and harmonic analysis. Proc. Berkeley Symp. Math. Statistics and 
Prob, Berkeley 1949, 303-343. 

[13] Continuous parameter martingales. Proc. Sec. Berkeley Symp. Math. Statistics 
and Prob, Berkeley 1951, 269-277. 


J. L. Doos, WARREN AMBROSE 


[1] On two formulations of the theory of stochastic processes depending upon a 
continuous parameter. Ann. Math. 41, 731-745 (1940). 


A. EINSTEIN 
[1] Zur Theorie der Brownschen Bewegung. Ann. Phys. IV 19, 371-381 (1906). 


WILLIAM FELLER 


[1] Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung. Math. Z. 
40, 521-559 (1935); II 42, 301-312 (1937). 

[2] Zur Theorie der stochastischen Prozesse (Existenz und Eindeutigkeitssätze). 
Math, Ann. 113, 113-160 (1936). 

[3] On the Kolmogoroff-P. Lévy formula for infinitely divisible distribution functions. 
Proc. Yugoslav Acad. Sci. 82, 95-1 12 (1937). 

[4] Uber das Gesetz der grossen Zahlen. Acta Univ. Szeged 8, 191-201 (1937). 

[5] On the integro-differential equations of purely discontinuous Markoff processes. 
Trans. Am. Math. Soc. 48, 488-515 (1940). Errata. Ibid. 58, 474 (1945). 

[6] An introduction to probability theory and its applications. Vol. 1. New York 
1950. 


BRUNO DE FINETTI 


[1] Sulle funzioni a incremento aleatorio. Rend. Accad. Naz. Lincei Cl. Sci. Fis. 
Mat. Nat. (6) 10, 163-168 (1929). 


ROBERT FORTET 


[1] Les fonctions aléatoires du type de Markoff associées a certaines equations 
linéaires aux derivées partielles du type paraboliques. J. Math. Pures Appl. 22, 
177-243 (1943). 


Maurice FRÉCHET 


[1] Sur l'allure asymptotique de !a suite des itérés d'un noyau de Fredholm. Quart. 
J. Math. Oxford Ser. 5, 106-144 (1934). 

[2] Recherches théoriques modernes sur le calcul des probabilités II. Méthode des 
fonctions arbitraires. Théorie des événements en chaine dans le cas d'un nombre 
fini d'états possibles. Paris 1938, 


644 BIBLIOGRAPHY 


B. V. GNEDENKO 
[1] On the theory of limit theorems for sums of independent random variables. 
(Russian) Bull. Acad. Sci. U.R.S.S. 181-232, 643-647 (1939). 
[2] Sur la croissance des processus stochastiques homogènes à accroissements 
indépendants. (Russian) Izvestiya Akad. Nauk S.S.S.R. 7, 89-110 (1943). 
[3] Limit theorems for sums of independent random variables. (Russian) Uspehi 
Matem. Nauk 10, 115-165 (1944). 


B. V. GNEDENKO, A. KOLMOGOROV 


[1] Limit distributions for sums of independent random variables. (Russian) 
Moscow-Leningrad (1949). 


PauL R. HALMos 
[1] Invariants of certain stochastic transformations: the mathematical theory of 
gambling systems. Duke Math. J. 5, 461-478 (1939). 
[2] Measurable transformations. Bull. Am. Math. Soc. 55, 1015-1034 (1949). 
[3] Measure theory. New York 1950. 
OLAF HANNER 
[1] Deterministic and non-deterministic stationary random processes. Ark. Mat. 1, 
161-177 (1949). 
G. HERGLOTZ 
[1] Uber Potenzreihen mit positivem reellen Teil im Einheitskreis. Ber. Verh. Kgl. 
Sachs. Ges. Wiss. Leipzig Math.-Phys. Kl. 63, 501-511 (1911). 
EINAR HILLE 
[1] Functional analysis and semi-groups. Am. Math. Soc. Coll. Publ. Vol. 31 1948. 


W. HöFFDING, H. ROBBINS 
[1] The central limit theorem for dependent random variables. Duke Math. J. 15, 
773-180 (1948), 
EBERHARD Hopr 
[1] Ergodentheorie. Erg. Math, 5, no. 2 (1937). 


B. Hostinsky 
[1] Méthodes générales du calcul des probabilités. Mém. Sci. Math, 52, 1931. 


C. T. Ionescu TULCEA 


[1] Mesures dans les espaces produits. Atti Accad. Naz, Lincei Rend. Cl. Sci. Fis. 
Mat. Nat. (8) 7 (1949), 208-211 (1950). 


Kıyosı Ito 


[1] On stochastic processes (I) (Infinitely divisible laws of probability). Jap. J. 
Math. 18, 261-301 (1942), 

[2] Stochastic integral. Proc. Imp. Acad. Tokyo 20, 519-524 (1944). 

[3] On a stochastic integral equation. Proc. Jap. Acad. nos. 1—4, 32-35 (1946). 

[4] On stochastic differential equations. Mem. Am. Math. Soc. 4, 51 pp- (1951). 


BIBLIOGRAPHY 645 


BØRGE JESSEN 
[1] The theory of integration in a space of an infinite number of dimensions. Acta 
Math, 63, 249-323 (1934). 


BØRGE JESSEN, AUREL WINTNER 
[1] Distribution functions and the Riemann zeta function. Trans. Am. Math. Soc. 
38, 48-88 (1935). 


STEFAN KaczMARZ, HUGO STEINHAUS 
[1] Theorie der Orthogonalreihen. Warsaw-Lwow (1935). 


SHIZUO KAKUTANI 
[1] Ergodic theorems and the Markoff process with a stable distribution. Proc. 
Imp. Acad. Tokyo 16, 49-54 (1940). 


E. R. VAN KAMPEN 
[1] Infinite product measures and infinite convolutions. Am. J. Math. 62, 417-448 
(1940). 


E. R. VAN KAMPEN, AUREL WINTNER 
[1] On divergent infinite convolutions. Am. J. Math. 59, 635-654 (1937). 


Kart KARHUNEN 

[1] Zur Spektraltheorie stochastischer Prozesse. Amn. Acad. Sci. Fennicae Ser. A, 

I. Math. Phys. 34, 7 pp. (1946). 

[2] Uber lineare Methoden in der Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. 
Fennicae Ser. A, I. Math. Phys. 37, 79 pp. (1947). 

[3] Uber die Struktur stationärer zufalliger Funktionen. Ark. Mat. 1, 141-160 
(1950). 


Tatsuo KAWATA 
[1] The function of mean concentration of a chance variable. Duke Math. J. 8, 
666-677 (1941). 


Tatsuo KawaTa, MASATOMO UDAGAWA 
On infinite convolutions. Kodai Math. Sem. Rep. no. 3, 15-22 (1949). 


A. KHINTCHINE 

[1] Asymptotische Gesetze der Wahrscheinlichkeitsrechnung. Erg. Math. 4, 77 pp. 
(1933). 

[2] Korrelationstheorie der stationäre stochastischen Prozesse. Math. Ann. 109, 
604-615 (1934). 

[3] Zur Theorie der unbeschrankt teilbaren Verteilungsgesetze. Rec. Math, (Mat. 
Sbornik) N.S. 2, 719-119 (1937). 

[4] Limit laws for sums of independent random variables. (Russian) Moscow- 
Leningrad 1938. 

[5] Theorie der abklingende Spontaneffekte. (Russian, German summary) Jzvestiya 
Akad. Nauk S.S.S.R. Ser. mat. 3, 313-322 (1938). 


646 BIBLIOGRAPHY 


A. KuINTCHINE, A. KOLMOGOROV 


[1] Uber Konvergenz von Reihen, deren Glieder durch den Zufall bestimmt werden. 
Rec. Math. (Mat, Sbornik) 32, 668-677 (1924). 


A. KOLMOGOROV 

[1] Uber die Summen durch den Zufall bestimmter unabhängiger Grössen. Math. 
Ann. 99, 309-319 (1928); Bemerkungen zu meiner Arbeit “Über die Summen 
zufälliger Grössen.” Math. Ann. 102, 484-488 (1930). 

[2] Sur la loi forte des grands nombres. C. R. Acad. Sci. Paris 191, 910-912 (1930). 

[3] Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung. Math. 
Ann. 104, 415-458 (1931). 

[4] Sulla forma generale di una processo stocastico omogeneo. (Una problema di 
Bruno de Finetti.) Rend. R. Accad. Naz. Lincei Cl. Sci. Fis. Mat. Nat. 15 (6), 
805-808 (1932) 

[5] Grundbegriffe der Wahrscheinlichkeitsrechnung. Erg. Mat. 2, no. 3 (1933). 

[6] Anfangsgründe der Markoffschen Ketten mit unendlich vielen möglichen 
Zuständen, Rec. Math. Moscou (Mat. Sbornik) 1 (43), 607—610 (1936). 

[7] Markov chains with a countable number of possible states. (Russian) Bull. 
Math. Univ. Moscou 1, no. 3, 16 pp. (1937). 

[8] Sur l'interpolation et extrapolation des suites stationnaires. C. R. Acad. Sci. 
Paris 208, 2043-2045 (1939). 

[9] Kurven in Hilbertschen Raum die gegenüber eine einparametrigen Gruppe von 
Bewegungen invariant sind. C. R. (Doklady) Acad. Sci. U.R.S.S. (N.S.) 26, 6-9 
(1940). 

[10] Wienersche Spiralen und einige andere interessante Kurven im Hilbertschen 
Raum, C. R. (Doklady) Acad. Sci. U.R.S.S. (N.S.) 26, 115-118 (1940). 

[11] Stationary sequences in Hilbert space. (Russian) Bull. Math. Univ. Moscou 2, 
no. 6, 40 pp. (1941). 

[12] Interpolation und Extrapolation von stationären zufälligen Folgen. Bull. Acad. 
Sci. U.R.S.S. Ser. Math. 5, 3-14 (1941). 


M. KReIN 

[1] On the problem of continuation of helical arcs in Hilbert space. (Russian) C. R. 
(Doklady) Acad. Sci. U.R.S.S. 45, 139-142 (1944). 

[2] On a generalization of some investigations of G. Szegö, V. Smirnoff, and A. 
Kolmogoroff. (Russian) C. R. (Doklady) Acad. Sci. U.R.S.S. (N.S.) 46, 91-94 
(1945). 

[3] Ona problem of extrapolation of A. N. Kolmogoroff. (Russian) C. R. (Doklady) 
Acad. Sci. U.R.S.S. (N.S.) 46, 306-309 (1945). 


N. M. Krytov, N. N. BoGoLiouBov 
[1] Sur les propriétés ergodiques de l'équation de Smoluchowsky. Bull. Soc. Math. 
France 64, 49-56 (1936). 
[2] Sur les probabilités en chaine. C. R. Acad. Sci. Paris 204, 1386-1388 (1937). 
[3] Les propriétés ergodiques des suites de probabilités en chaine. C. R. Acad. Sci. 
Paris 204, 1454-1456 (1937). 


KIYONORI KĶKUNISAWA 


[1] On an analytical method in the theory of independent random variables. Ann. 
Inst. Statist. Math. Tokyo 1, 1-77 (1949). 


BIBLIOGRAPHY 647 


PauL Lévy 


1] Sur les séries dont les termes sont des variables éventuelles indépendantes. 

Studia Math. 3, 119-155 (1931). 

2] Sur les intégrales dont les éléments sont des variables aléatoires indépendantes. 

Ann. Scuola Norm. Sup. Pisa (2) 3, 337-366 (1934); Observation sur un précédent 

mémoire de l’auteur. Zbid. 4, 217-218 (1935). 

Propriétés asymptotiques des sommes_de variables aléatoires indépendantes ou 

enchainées. J. Math. Pures Appl. Ser. 8 14, 347-402 (1935). 

4] Propriétés asymptotiques des sommes de variables aléatoires enchainées. Bull. 

Sci. Math. (2) 59, 84-96, 109-128 (1935). 

5] Théorie de l'addition des variables aléatoires. Paris 1937. 

[6] Le mouvement Brownien plan. Am. J. Math. 62, 487-550 (1940). 

[7] Processus stochastiques et mouvement Brownien. Paris 1948. 

8] Systèmes markoviens et stationnaires. Cas dénombrable, Ann. Sci. Ecole 
Norm. Sup. 68, 40-381 (1951). 


T 


[ 


J. W. LINDEBERG 


[1] Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung. 
Math. Zeitschr. 15, 211-225 (1922). 


MıcHeL Lokve 
[1] Sur les fonctions aléatoires stationnaires de second ordre. Rey, Sci. 83, 297-303 
(1945). 
[2] Etude asymptotique de sommes de variables aléatoires liées. J. Math, Pures 
Appl. (9) 24, 249-318 (1945). 
[3] Quelques propriétés des fonctions aléatoires de second ordre. C. R. Acad. Sci. 
Paris 222, 469-470 (1946). 
[4] Fonctions aléatoires de second ordre. Rev. Sci. 84, 195-206 (1946). 


J, MARCINKIEWICZ 

[1] Quelques théorémes sur les fonctions indépendantes. Studia Math. 7, 104-120 
(1937). 

[2] Sur les fonctions indépendantes I. Fund. Math. 30, 202-214 (1938). 

[3] Sur les fonctions indépendantes II. Fund. Math. 30, 349-364 (1938). 


J. MARCINKIEWICZ, A. ZYGMUND 
[1] Sur les fonctions indépendantes. Fund. Math. 29, 60-90 (1937). 


A. A, Markov 

[1] Extension of the law of large numbers to dependent events. (Russian) Bull. 
Soc. Phys. Math. Kazan (2) 15, 135-156 (1906). 

Calculus of probability. (Russian) 4th ed. Moscow 1924. 


Nv 


[ 


GisiRO MARUYAMA 


[I] The harmonic analysis of stationary stochastic processes. Mem. Fac, Sci. 
Kyusyu Univ. A 4, 45-106 (1949). 


648 BIBLIOGRAPHY 


JOHN VON NEUMANN 
[1] Allgemeine Eigenwerttheorie Hermitischer Funktionaloperatoren, Math. Ann. 


102, 49-131 (1929). 
[2] Uber einen Satz von Herrn M. H. Stone. Ann. Math. 33, 567-573 (1932). 


[3] Proof of the quasi-ergodic hypothesis. Proc. Natl. Acad. Sci. U.S.A. 18, 70-82 


(1932). 
[4] Functional operators I. Measures and integrals. Ann. Math. Studies no. 21, 


Princeton, New Jersey. (Reprint of a multigraphed edition of 1935.) 


JOHN VON NEUMANN, I. J. SCHONBERG 
1] Fourier integrals and metric geometry. Trans. Am. Math, Soc. 50, 226-251 
(1941). 


A. OBUKHOFF 
1] On the energy distribution in the spectrum of a turbulent flow. C. R. (Doklady) 
Acad, Sci. U.R.S.S. 32, 19-21 (1941). 


BEDRICH Pospisit 
1] Sur un problème de M. M. S. Bernstein et A. Kolmogoroff. Casopis Pést. Mat. 
Fys, 65, 64-76 (1935-36). 


RENE DE POSSEL 
1] Sur la dérivation abstraite des fonctions d'ensemble. C. R. Acad. Sci. Paris 201, 
579-581 (1935); J. Math. Pures Appl. 15, 391-409 (1936). 


F. Riesz 
[1] Sur la théorie ergodique. Comm. Math. Helvetici 17, 221-239 (1945). 


V. I. ROMANOVSKI 
[1] Discrete Markov chains. (Russian) Moscow-Leningrad 1949, 


E. SLUTSKY 

[1] Sur les fonctions éventuelles continues, intégrables et dérivables dans le sens 
stochastique. C. R. Acad. Sci. Paris 187, 878-880 (1928). 

[2] Alcuni proposizioni sulla theoria degli funzioni aleatorie. Giorn. Ist. Ital. 
Attuari 8, 183-199 (1937). 

[3] Sur les fonctions aléatoires presques periodiques et sur la decomposition des 
fonctions aléatoires stationnaires en composantes. Actualités Sci. Ind. 738, 35-55 
(1938). 


MARSHALL HARVEY STONE 
[1] Linear transformations in Hilbert space IIT. Operational methods and group 
theory. Proc. Natl. Acad. Sci. U.S.A. 16, 172-175 (1930). 
[2] Linear transformations in Hilbert space and their applications to analysis. Am. 
Math. Soc. Coll. Publ. 15 (1932). y 


BIBLIOGRAPHY 649 


G. SzEGÖ 
[1] Beiträge zur Theorie der Toeplitzschen Formen. Math. Zeitschr. 6, 167-202 
(1920). 


JEAN VILLE 
[1] Étude critique de la notion de collectif. Paris 1939. 


A. WALD 
[1] On cumulative sums of random variables. Ann. Math. Statistics 15, 283-296 


(1944). 
[2] Note on the consistency of the maximum likelihood estimate. Ann. Math. 


Statistics 20, 595-601 (1949). 


NORBERT WIENER 
[1] Differential space. J. Math. Phys. Math. Inst. Tech. 2, 131-174 (1923). 
[2] Generalized harmonic analysis. Acta Math. 55, 117-258 (1930). 
[3] Extrapolation, interpolation, and smoothing of stationary time series, With 
engineering applications. Cambridge-New York 1949. (Reprinted from a 
publication issued with restricted circulation in 1942.) 


AUREL WINTNER 
[1] Zur Theorie der beschränkten Bilinearformen. Math. Zeitschr. 30, 228-282 


(1929). 
[2] Asymptotic distributions and infinite conyolutions. Princeton, 1938. 
[3] The Fourier transforms of probability distributions. Baltimore 1947. 


H. WoD 
[1] A study in the analysis of stationary time series. Uppsala 1938. 


A. M. YAGLOM 


[I] The ergodic principle for Markov processes with stationary distributions. 
(Russian) Doklady Akad. Nauk S.S.S.R. (N.S.) 56, 347-349 (1947). 


Kosaku YOsIDA 
[1] The Markoff process with a stable distribution. Proc. Imp. Acad. Tokyo 16, 


43-48 (1940). 
[2] Simple Markoff process with a locally compact phase space. Math. Japonicae 1, 


99-103 (1948). 


Kôsaku YOosIDA, SHIZUO KAKUTANI 
[1] Markoff process with an enumerable infinite number of possible states. Jap. J: 


Math. 16, 41-55 (1939). 
[2] Operator theoretical treatment of Markoff’s process and mean ergodic theorem. 


Ann. Math. 42, 188-228 (1941). 


ANTONI ZYGMUND 
[1] Trigonometrical series. Warsaw-Lwow 1935. 


Index 


(This index includes the additional mathematical material, but not the names 


or historical remarks, in the Appendix.) 


Absolute centering constants 110. 

Absolute probabilities of a Markov 
process 172, 191, 214. 

Absolutely continuous set function 611. 

Absolutely continuous spectral distribu- 
tion 498, 532. R 

Absorbing barrier 243. 

Additive process 391. 

Adjunction, extension of a process by 
TE 

Admissible Borel field 208. 


Backward equations: chain case 254, 
272; general purely discontinuous 
case 270, 273; diffusion case 274. 

Bare functions 600; generalized 613. 

BESSEL’s inequality 152. 

BOREL-CANTELLI lemma 104. 

Bore field 599. 

BoreL measurable function 600. 

Bore set 600; generalized 613. 

BROWNIAN movement process: defini- 
tion 97; conditions that a martingale 
be one 384; general discussion 392; 
condition that a process with inde- 
pendent increments be one 420. 


CaMPBELL’s theorem 433. 

Card mixing 186. 

Centering constants 110; absolute 110. 

Centering function of a process with 
independent increments 407. 

Central limit theorem: sums of mutu- 
ally independent random variables 
137; Markov processes 221; martin- 
gales 383. 

CHAPMAN-KOLMOGOROV equation 88, 
235, 255; 

Characteristic function of a distribu- 
tion 37; application to convergence 
of series of mutually independent 
random variables 115; application to 
central limit theorem 139. 


Closed linear manifold 75, 149. 

Complete measure 5, 606, 623. 

Completely additive set function 604. 

Conditional probabilities and expecta- 
tions: definition 15; conditional prob- 
ability distribution 26, 623 (wide 
sense 29); iterated 35; Gaussian case 
76; wide sense 155. 

Consequent: integer of a Markov chain 
176; set of a Markov process 206. 
Consistency of an estimation procedure 

633. 

Continuity properties of stochastic 
process sample functions: Markov 
chain 246, 248, 265; Markov proc- 
ess 258, 260, 266, 267, 388; martin- 
gale 361; process with independent 
increments 388, 420, 422; Brownian 
movement 393. 

Convergence: stochastic, in measure, 
in mean, with probability one 8; 
in distribution 9. 

Convex function of a semi-martingale 
or martingale 295. 

Convolution 78. 

Covariance function: general charac- 
terization 72; definition in stationary 
case 95 (multidimensional case 596); 
characterization in stationary case 
473, 518. 

Cyclically moving sets: Markov chain 
177; general state space 211. 

Cylinder set 600. 


D: Hypothesis 192. 

Density of a distribution 6. 

Derivative of a set function relative to 
a net 343, 612. 

Determined: set determined by condi- 
tions on specified random variables 
292. 

Deterministic process 564. 

Difference sets 511; field 511; manifold 
Sia 


651 


652 


Differential process 391. 

Differentiation of sample functions: 
stationary process 535; process with 
stationary increments 558. 

Diffusion equations 275. 

Diffusion-type process 273. 

Distribution function 6; density 6; mul- 
tivariate 6. 

Dominated: process dominated by a 
semi-martingale 297. 


Ergodic classes of a Markov process: 
chain case 179; continuous state 
space 209, 210. 

Ergodic theorem: discrete parameter 
464; continuous parameter 515. 

Estimation of covariance and spectral 
distribution functions: discrete pa- 
rameter 493; continuous parameter 
531. 

Expectation of a random variable 8. 

Extension of a stochastic process by 
adjunction 71. 


Fair game 299. 

Favorable game 299. 

Field of sets 599. 

Filter 638. 

Fixed point of discontinuity of a sto- 
chastic process 357. 

FOKKER-PLANCK equation 275. 

Forward equations: chain case 254, 
272; general purely discontinuous 
case 270, 273; diffusion case 274. 

FOURIER series 150. 

Fourier transform of a process with 
orthogonal increments 434, 

Function space type, process of 67. 

Fundamental theorem of sequential 
analysis 352. 


Gain of a linear operation on a sta- 
tionary process 534, 

Game of chance: system 145; fair, 
favorable game 299; invariance of 
fairness and favorableness under op- 
tional stopping 300, sampling 302, 
373, 376, skipping 309. 

Gaussian process: definition 71; cri- 
terion for existence 72; conditional 
expectations in one 390; conditions 
that a process with independent in- 


INDEX 


crements be one 420; metric transi- 
tivity of 637. 


Harmonic analysis of a stationary proc- 
ess 469, 517; (see also Spectral rep- 
resentation). 


Independent increments, process with 
(see Chapter VIII): definition 96; 
stationary increments 97, 512; sam- 
ple function continuity of 388, 422. 

Independent random variables (sce 
Chapter III): definition 7; processes 
with 78, 102. 

Infinitely divisible distribution 128. 

Integral with independent random ele- 
ments 391. 

Integration in infinitely many dimen- 
sions 342. 

Integration of sample functions 62; in 
a stationary process 538. 

Invariant random variables: under 
measure-preserving transformations 
(strict sense) 457, 610; of a station- 
ary Markov process 460; under iso- 
metric transformations (wide sense) 
463. 

Invariant set: of a Markov process 
206, 460; minimal invariant set of a 
Markov process 206; of measure- 
preserving transformations 457, 510. 

Isometric transformations 461; semi- 
group and group of 512. 


Jensen’s inequality for conditional 
probabilities 33. 
Jump 246. 


Large numbers, law of: for strictly sta- 
tionary processes 95, 465, 515; defi- 
nition 122; for sums of independent 
random variables 123 (with a com- 
mon distribution 142, 341); for sums 
of orthogonal random variables 158; 
for Markov processes 218; for proc- 
esses with stationary independent in- 
crements 364; for wide sense sta- 
tionary processes 489, 529. 

Least squares approximation 76; linear 
77. 

LEBESGUE-STIELTJES measure 607. 

Likelihood ratio 93, 348. 


INDEX 


Linear manifold 149; closed 75. 

Linear operations on stationary proc- 
esses; discrete parameter 500; con- 
tinuous parameter 534. 

Lower semi-martingale 294. 


Markov chain: discrete parameter 170; 
application to card mixing 186; con- 
tinuous parameter 235, 265, 271, 388. 

Markov process (see Chapters V and 
VI): definition 80 (wide sense 90); 
covariance function in wide sense 
case 233; (see also MARKOV chain, 
Stationary MARKOV process). 

Markov property 81. 

Marxov transition function 255. 

Markov transition matrix function 236. 

Martingale (see Chapter VII): defini- 
tion 91; wide sense 164; relative to 
specified Borel fields 294; defined by 
stochastic integrals 444. 

Measurability: of a stochastic process 
60; of sample functions 62. 

Measurable set on the sample space of 
specified random variables 19. 

Measure function 605; complete meas- 
ure 5, 606; probability measure 605; 
Lebesgue-Stieltjes measure 607. 

Measure-preserving point transforma- 
tions 452, 617; translation semi- 
group, group of 507. 

Measure-preserving set transformations 
452; translation semi-group, group 
of 507. 

Metrically transitive transformation 457 
(wide sense 463); stochastic process 
457 (wide sense 463); Markov proc- 
ess 460; process with independent 
random variables 460; process with 
orthogonal random variables 464; 
translations of [0,1] modulo one 508; 
process relative to the difference field 
511; process with stationary incre- 
ments 512; process with stationary 
(wide sense) orthogonal increments 
514. 

Minimal invariant set of a Markov 
process 206. 

Molecular distributions 404. 

Moving averages, process of: discrete 
parameter 498; finite average 504; 
continuous parameter 532. 


653 


Moving point of discontinuity 357. 

Multidimensional prediction 594. 

Multiple Markov process 89; applica- 
tion to card mixing 186. 


Optional sampling: discrete parameter 
301; continuous parameter 366. 

Optional skipping 310. 

Optional stopping: discrete parameter 
300; continuous parameter 366. 

Orthogonal increments, process with 
(see Chapter IX): definition 99; 
metric transitivity of 514. 

Orthogonal random variables, processes 
with (see Chapter IV): definition 79. 

Orthogonality 74. 

Orthogonalization 151. 


Poisson process: definition 98; general 
discussion 398; application to molec- 
ular and stellar distributions 404. 

Polynomial approximation 562. 

Positive definite function: discrete argu- 
ment 473; continuous argument 519. 

Prediction (see Chapter XII): multiple 
Markov discrete parameter 506; by 
way of a stochastic differential equa- 
tion 550. 

Probability measure 605. 

Projection: definition 155; wide sense 
martingale limit theorems 166. 

Purely random events 400. 


q-bounded set 260, 265. 


Random events 400. 

Random variable: definition 5; on the 
sample space of specified random 
variables 19. 

Random walk 308. 

Rational spectral densities: (discrete 
parameter) in eT 501; in à (con- 
tinuous parameter) 542. 

Reduction procedure 204. 

Reflection principle 393. 

Regular stochastic process 564. 

Representation of a family of random 
variables 12; applied to conditional 
expectations 33; detailed justification 
623. 


654 


Sample functions: definition 11; meas- 
urability 22; integration 62; differen- 
tiation 535, 558. 

Sample space: definition 3; function or 
set measurable on the sample space 
of specified random variables 19. 

Semi-martingale (see Chapter VII): 
definition 292; relative to specified 
Borel fields 294. 

Separability of a stochastic process 52; 
relative to a specified class of sets 51. 

Sequential analysis, application of mar- 
tingale theory to: discrete parameter 
350; continuous parameter 380. 

Series: of mutually independent ran- 
dom variables 105, 335; three series 
theorem 111; Fourier series 150; of 
orthogonal random variables 155; of 
power series type 159. 

Set of increase of a singular set func- 
tion 611, 

Shift transformation: discrete param- 
eter measure-preserving case 455; 
discrete parameter isometric case 462; 
continuous parameter measure-pre- 
serving case 510; continuous param- 
eter isometric case 512. 

Singular set of a singular set function 
611. 

Singular set function 611; component 
of a set function 611. 

SMOLUCHOVSKI equation 88. 

Spectral decomposition of a stationary 
process: discrete parameter 486; con- 
tinuous parameter 529. 

Spectral density of a stationary process: 
discrete parameter 476; continuous 
parameter 522. 

Spectral distribution function of a sta- 
tionary process: discrete parameter 
476; continuous parameter 522. 

Spectral representation of a stationary 
process: discrete parameter 481; con- 
tinuous parameter 527. 

Spectrum of a stationary process 476. 

Standard extension of a stochastic proc- 
ess 69, 

Standard modification of a stochastic 
process 66. 

Standard pair of q-functions 265. 

Stationary independent increments, 
process with 419. 


ari 63. 


INDEX 


Stationary Marxov process (wide 
sense) 437, 506, 523, 550, 566; 
Gaussian case 218, 234, 506. 

Stationary Markov transition function 
256. 

Stationary. process (see Chapters X, 
XI): definition (strict sense) 94, 
(wide sense) 95, (multidimensional 
wide sense) 596, 

Stationary (wide sense) 
process with 99, 551. 

Stellar distributions 404. 

Step function 426; (t,w) step function 
438. 

Stochastic difference equations 503. 

Stochastic differential equations: dif- 
fusion type 273; satisfied by a sta- 
tionary process 546, 559. 

Stochastic integral 62, 426, 436, 540. 

Stochastic matrix 172. 

Stochastic process (see individual types 
under their own names): definition 
46. 

Stochastic transition function 190; den- 
sity 193. 

Stochastically definite process 625. 

Strict sense concepts 77. 


increments, 


Temporally homogeneous process 96. 

Three series theorem 111. 

Transient state of a Markov chain 178; 
set of a Markov process 210. 

Translation group and semi-group of 
transformations: measure-preserving 
507; isometric 512. 


Uncorrelated increments, process with 
(see Chapter IX): definition 99. 
Uncorrelated random variables, process 

with (see Chapter IV): definition 79. 
Uniform integrability: of the random 
variables of a semi-martingale 311; 
definition 629. 
Unitary transformation 461, 636, 637. 
Upcrossings 315. 


Weak sense concepts 77. 


Zero-one law 102; proof by martingale 
theory 334; continuous parameter 


Form No. 3. 
PSY, RES.L-1 


Bureau of Educational & Psychological 
Research Library. 
a 
The book is to be returned within 
the date stamped last. 


cece esceecnccctersfereeseeaceesetessslsscevecsveesereses 


sssesssesssereoosejecsseesesosesosesejessesoseessoseenan 


—— Iaa IaaaŘIIaaaaaaaaasaaaaasasaaasasasasasaeaela 


WBGP-59/60-51190-5M 


epee 


BOOK GARD 


Coll. Now... --eees eee Acen, No. 


weclecevecrerscelecccracecsseree 


E CS 


